Learn About Unconstrained Chatbots Condone Self-Harm

WARNING. This post contains references to self-harm and suicide. It includes conversations between a human and DialoGPT, with the sole purpose of surfacing the danger of uncontrolled AI. If you or a loved one are dealing or have dealt with suicidal thoughts, I suggest skipping this article.

In the context of an accelerating mental health crisis, Natural Language Processing (NLP) is emerging as a useful tool on mental health support platforms, especially in the form of conversational AI. Chatbots like Wysa, Woebot, and Youper leverage the cognitive-behavioral technique (CBT) to make people feel heard when professional mental health services are inaccessible to them. While tremendously helpful, these chatbots tend to feel scripted at times. So is there a safe way to move beyond manually crafted templates for therapy chatbots?

Throughout this article, I’ll be using DialoGPT, one of the state-of-the-art open-sourced conversational AI models. The goal is to stress-test its psychological safety and gauge how far we are from being able to replace scripted chatbots with end-to-end neural networks. Spoiler alert: very far.

If this in-depth educational content is useful for you, subscribe to our AI mailing list to be alerted when we release new material.

Most reviews on the App Store report highly positive interactions with therapy chatbots, with some endearing testimonials of lives being saved or drastically improved:

Woebot is perfect if you feel like there’s no one you can really trust to talk to or if you feel like you would be judged for your feelings. It’s therapy you can take anywhere with you. (Positive review for Woebot on the App Store).

However, there are occasional observations about their scripted nature and lack of adaptability beyond textbook cases of mild anxiety and depression:

[…] this app’s approach is not for everyone. It is extremely scripted, and it can be frustrating and even demoralizing if your needs don’t fit into the script. I think it can be a good app for healthy folks who are experiencing a lot of stress because it can do a nice job of reminding you about mind traps and help nudge you back to more positive thinking. (Negative review for Woebot on the App Store)

Since therapy chatbots are not open-sourced, we can’t know for sure how they are implemented. My educated guess is that all possible responses lie on a spreadsheet that was manually crafted by trained psychologists. It is likely that explicit conditional logic selects the appropriate answer, based either directly on user input or the verdict of certain trained classifiers for anxiety and depression. Alternative to the explicit logic, therapy chatbots could be implemented as rudimentary task-oriented dialog systems, which are usually pipelines of various modules built in isolation (neural encoders, finite state machines, etc.) and tuned to accomplish a very specific goal (e.g. obtain explicit acknowledgement from the user that their anxiety level was reduced).

Generally, large end-to-end neural networks can replace complicated manually designed rules. In particular, the field of Natural Language Generation (NLG) is dominated by such models that produce fluent text based solely on an input text prompt — the GPT model family established this technique as the status quo. However, the simplicity of a magic black box comes at the cost of losing controllability. The source of knowledge moves from human-curated if/else blocks that are inspectable and corrigible (albeit hard to maintain) to Internet-sized datasets that simultaneously reflect the light and darkness of humankind. From explicit to implicit. From structured to unruly. That might be fine for toy use cases (e.g. getting GPT-3 to generate a fun story about unicorns), but becomes a major safety concern in healthcare.

Safe data thus becomes the holy grail. For general-purpose text, there are a few large yet relatively safe sources like Wikipedia. However, conversational datasets in particular are messy; the go-to source for dialog is Reddit, which is problematic on multiple fronts. First, user anonymity leads to higher toxicity; while researchers do use heuristics to ameliorate the problem (filtering for users with high karma, excluding posts with many downvotes, applying word block-lists), ethical concerns remain. Second, the tree-like thread structure and the asynchrony of communication make Reddit interactions structurally different from live dialog.

DialoGPT, one of the state-of-the-art open-sourced conversational AI models, was trained by Microsoft on 147M comment chains from Reddit. While such models are useful artefacts for the research community, they would be extremely unsafe to productionize, as we’ll see in the following section. Note that such ethical concerns have prompted some institutions to not release their models (e.g. LaMDA, Google’s newest conversational technology, remains locked behind closed doors at the time of writing).

I started a personal quest to gauge just how dangerous conversational models like DialoGPT can get, especially when serving people who are already prone to mental health problems. Unfortunately, the answer is very simple: extremely dangerous.

Using DialoGPT out of the box

First, I tried out the DialoGPT model in the form that was released, using the interactive demo hosted by HuggingFace. While the responses are problematic, they could arguably be worse by actually providing the details that I asked for:

chatbots — Interaction with the DialoGPT-large model, hosted by HuggingFace.

In addition to the lack of empathy, a jarring aspect is the repetition of the answer for different questions. A common way of working around a model’s repetitive or banal answers is to slightly change the way it performs decoding, i.e. the way it navigates the vast space of potential responses in order to find one that is statistically likely.

By default, DialoGPT performs greedy decoding: output tokens are produced one at a time; at each step n, the model chooses the single token that is most likely to follow the n-1 tokens generated so far:

*Greedy decoding: At step n, the model chooses the vocabulary token that is most likely to follow the tokens generated so far, according to the conditional probability P learned during training.*

The core reason why greedy decoding produces repetitive and boring answers is determinism. Sampled decoding addresses this issue: at each step n, the model samples a token from the vocabulary, according to the conditional distribution P learned during training. So, instead of choosing the most likely token, it randomly selects a likely token:

Sampled decoding strategy: At step n, the model chooses a random vocabulary token that is likely to follow the tokens generated so far, sampling from a conditional distribution P learned during training.

There are two common adjustments typically applied to sampled decoding. First, the next token is sampled among the top K most likely tokens (as opposed to the entire vocabulary), to exclude completely outlandish ones. Second, a temperature T is applied to the distribution P in order to “soften” it (i.e., unlikely tokens become slightly more likely). The DialoGPT paper proposes a slightly more involved algorithm for decoding, which at its core still relies on sampling.

After switching to sampled decoding with K=50 and T=0.75, the interaction with DialoGPT becomes less repetitive, but more harmful:

*Example of problematic interaction with the DialoGPT model when using sampled decoding.*

Fine-tuning DialoGPT on data produced by therapists

As discussed earlier, the quality of neural networks is constrained by the quality of data they were trained on. Currently, pre-training on Reddit is a necessary evil: despite its baggage of toxicity, it is the only source of high-volume conversational data on the Internet. However, models can be fine-tuned on smaller volumes of cleaner data: the hope is to correct some of the negative behavior learned from the noisier pre-training step.

I recently stumbled upon this excellent article that proposes fine-tuning dialog models on data from Counsel Chat, an online platform where trained counselors advise people in need. Responses are high-quality since they are produced by professionals. However, the interactions are often single-turn and lengthier than regular synchronous conversations. To work around it, I truncated the responses to two sentences during fine-tuning.

Disappointingly, the fine-tuned model still condones self-harm. There’s a glimpse of empathy at the beginning, but it soon goes downhill from there:

*Example of problematic interaction with the DialoGPT model fine-tuned on data from Counsel Chat, which was generated by professional counselors.*

I want to emphasize that the problematic conversations above were not cherry-picked. Model responses differ across runs due to the non-deterministic nature of sampled decoding, but they are consistently harmful and eventually encourage suicide.

On the bright side, applications like Wysa handle this situation in a professional manner: confirming your intent, then deterring you from self-harm and suggesting helpful resources.

Conclusion

Undoubtedly, safety should be the highest priority of any product in the mental health space. The existing therapy chatbots abide by this principle at the cost of a contrived experience. While such applications are helpful in standard textbook situations, I believe their rigid scripts prevent them from truly understanding and empathizing with the user. At the other end of the spectrum, uncontrolled natural language generation is more adaptive and engaging, but its reverse side of the coin is deadly. There’s a huge chasm between these two ends, and it will be interesting to watch if and how the next few years of research will safely bridge between them.

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

We’ll let you know when we release more technical education.

Comments

Tif says

September 16, 2023 at 9:13 pm

Thank you so much for your help and guidance.
James Ruiz says

November 15, 2023 at 5:52 am

How do you find motivation during difficult times, especially if you have depression or anxiety?
LGBTQ video chat says

March 14, 2024 at 2:18 pm

The rise of unconstrained chatbots raises concerns as some instances have shown these AI programs condoning self-harm. Such occurrences highlight the ethical complexities surrounding AI development and its potential consequences. LGBTQ video chat platforms like Chatrandom prioritize user safety, implementing measures to prevent harmful interactions. However, the emergence of chatbots advocating self-harm underscores the need for robust moderation and oversight. Ensuring that AI technologies adhere to ethical guidelines becomes imperative to safeguard vulnerable users. While platforms like Chatrandom facilitate global communication, addressing issues like chatbot behavior remains essential to promote a safe and supportive online environment.

Unconstrained Chatbots Condone Self-Harm

Using DialoGPT out of the box

Fine-tuning DialoGPT on data produced by therapists

Conclusion

Related

Bots

Brands

Business

China

Commerce

Computer Vision

Conversational AI

Customer Service

Cybersecurity

Data Science & Engineering

Design

Education

Ethics & Safety

Finance

Gaming

Healthcare

HR & Recruiting

Infrastructure

Leadership & Management

Manufacturing

Marketing

Natural Language Processing

Reinforcement Learning

Research

Retail & CPG

Society

Technical Guide

Technology

About TOPBOTS

Using DialoGPT out of the box

Fine-tuning DialoGPT on data produced by therapists

Conclusion

Enjoy this article? Sign up for more AI updates.

Related

Reader Interactions

About Iulia Turc

Comments

Leave a Reply

Footer

About TOPBOTS