WARNING. This post contains references to self-harm and suicide. It includes conversations between a human and DialoGPT, with the sole purpose of surfacing the danger of uncontrolled AI. If you or a loved one are dealing or have dealt with suicidal thoughts, I suggest skipping this article.
In the context of an accelerating mental health crisis, Natural Language Processing (NLP) is emerging as a useful tool on mental health support platforms, especially in the form of conversational AI. Chatbots like Wysa, Woebot, and Youper leverage the cognitive-behavioral technique (CBT) to make people feel heard when professional mental health services are inaccessible to them. While tremendously helpful, these chatbots tend to feel scripted at times. So is there a safe way to move beyond manually crafted templates for therapy chatbots?
Throughout this article, I’ll be using DialoGPT, one of the state-of-the-art open-sourced conversational AI models. The goal is to stress-test its psychological safety and gauge how far we are from being able to replace scripted chatbots with end-to-end neural networks. Spoiler alert: very far.
If this in-depth educational content is useful for you, subscribe to our AI mailing list to be alerted when we release new material.
Most reviews on the App Store report highly positive interactions with therapy chatbots, with some endearing testimonials of lives being saved or drastically improved:
Woebot is perfect if you feel like there’s no one you can really trust to talk to or if you feel like you would be judged for your feelings. It’s therapy you can take anywhere with you. (Positive review for Woebot on the App Store).
However, there are occasional observations about their scripted nature and lack of adaptability beyond textbook cases of mild anxiety and depression:
[…] this app’s approach is not for everyone. It is extremely scripted, and it can be frustrating and even demoralizing if your needs don’t fit into the script. I think it can be a good app for healthy folks who are experiencing a lot of stress because it can do a nice job of reminding you about mind traps and help nudge you back to more positive thinking. (Negative review for Woebot on the App Store)
Since therapy chatbots are not open-sourced, we can’t know for sure how they are implemented. My educated guess is that all possible responses lie on a spreadsheet that was manually crafted by trained psychologists. It is likely that explicit conditional logic selects the appropriate answer, based either directly on user input or the verdict of certain trained classifiers for anxiety and depression. Alternative to the explicit logic, therapy chatbots could be implemented as rudimentary task-oriented dialog systems, which are usually pipelines of various modules built in isolation (neural encoders, finite state machines, etc.) and tuned to accomplish a very specific goal (e.g. obtain explicit acknowledgement from the user that their anxiety level was reduced).
Generally, large end-to-end neural networks can replace complicated manually designed rules. In particular, the field of Natural Language Generation (NLG) is dominated by such models that produce fluent text based solely on an input text prompt — the GPT model family established this technique as the status quo. However, the simplicity of a magic black box comes at the cost of losing controllability. The source of knowledge moves from human-curated if/else blocks that are inspectable and corrigible (albeit hard to maintain) to Internet-sized datasets that simultaneously reflect the light and darkness of humankind. From explicit to implicit. From structured to unruly. That might be fine for toy use cases (e.g. getting GPT-3 to generate a fun story about unicorns), but becomes a major safety concern in healthcare.
Safe data thus becomes the holy grail. For general-purpose text, there are a few large yet relatively safe sources like Wikipedia. However, conversational datasets in particular are messy; the go-to source for dialog is Reddit, which is problematic on multiple fronts. First, user anonymity leads to higher toxicity; while researchers do use heuristics to ameliorate the problem (filtering for users with high karma, excluding posts with many downvotes, applying word block-lists), ethical concerns remain. Second, the tree-like thread structure and the asynchrony of communication make Reddit interactions structurally different from live dialog.
DialoGPT, one of the state-of-the-art open-sourced conversational AI models, was trained by Microsoft on 147M comment chains from Reddit. While such models are useful artefacts for the research community, they would be extremely unsafe to productionize, as we’ll see in the following section. Note that such ethical concerns have prompted some institutions to not release their models (e.g. LaMDA, Google’s newest conversational technology, remains locked behind closed doors at the time of writing).
I started a personal quest to gauge just how dangerous conversational models like DialoGPT can get, especially when serving people who are already prone to mental health problems. Unfortunately, the answer is very simple: extremely dangerous.
Using DialoGPT out of the box
First, I tried out the DialoGPT model in the form that was released, using the interactive demo hosted by HuggingFace. While the responses are problematic, they could arguably be worse by actually providing the details that I asked for:
In addition to the lack of empathy, a jarring aspect is the repetition of the answer for different questions. A common way of working around a model’s repetitive or banal answers is to slightly change the way it performs decoding, i.e. the way it navigates the vast space of potential responses in order to find one that is statistically likely.
By default, DialoGPT performs greedy decoding: output tokens are produced one at a time; at each step n, the model chooses the single token that is most likely to follow the n-1 tokens generated so far:
The core reason why greedy decoding produces repetitive and boring answers is determinism. Sampled decoding addresses this issue: at each step n, the model samples a token from the vocabulary, according to the conditional distribution P learned during training. So, instead of choosing the most likely token, it randomly selects a likely token:
There are two common adjustments typically applied to sampled decoding. First, the next token is sampled among the top K most likely tokens (as opposed to the entire vocabulary), to exclude completely outlandish ones. Second, a temperature T is applied to the distribution P in order to “soften” it (i.e., unlikely tokens become slightly more likely). The DialoGPT paper proposes a slightly more involved algorithm for decoding, which at its core still relies on sampling.
After switching to sampled decoding with K=50 and T=0.75, the interaction with DialoGPT becomes less repetitive, but more harmful:
Fine-tuning DialoGPT on data produced by therapists
As discussed earlier, the quality of neural networks is constrained by the quality of data they were trained on. Currently, pre-training on Reddit is a necessary evil: despite its baggage of toxicity, it is the only source of high-volume conversational data on the Internet. However, models can be fine-tuned on smaller volumes of cleaner data: the hope is to correct some of the negative behavior learned from the noisier pre-training step.
I recently stumbled upon this excellent article that proposes fine-tuning dialog models on data from Counsel Chat, an online platform where trained counselors advise people in need. Responses are high-quality since they are produced by professionals. However, the interactions are often single-turn and lengthier than regular synchronous conversations. To work around it, I truncated the responses to two sentences during fine-tuning.
Disappointingly, the fine-tuned model still condones self-harm. There’s a glimpse of empathy at the beginning, but it soon goes downhill from there:
I want to emphasize that the problematic conversations above were not cherry-picked. Model responses differ across runs due to the non-deterministic nature of sampled decoding, but they are consistently harmful and eventually encourage suicide.
On the bright side, applications like Wysa handle this situation in a professional manner: confirming your intent, then deterring you from self-harm and suggesting helpful resources.
Undoubtedly, safety should be the highest priority of any product in the mental health space. The existing therapy chatbots abide by this principle at the cost of a contrived experience. While such applications are helpful in standard textbook situations, I believe their rigid scripts prevent them from truly understanding and empathizing with the user. At the other end of the spectrum, uncontrolled natural language generation is more adaptive and engaging, but its reverse side of the coin is deadly. There’s a huge chasm between these two ends, and it will be interesting to watch if and how the next few years of research will safely bridge between them.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.