How do you order toilet paper online?

If you were using a modern graphical user interface (GUI), you would:

  1. Go to your computer
  2. Open up a browser
  3. Type in Amazon, then type “toilet paper” into the search window
  4. Suffer analysis paralysis over the gazillion selections that pop up
  5. Make a choice but then be confronted with more choices over how many packs to get
  6. Sign in if you haven’t already
  7. Put in your payment information if you haven’t already
  8. Be confronted with yet another choice on whether you should subscribe for regular deliveries or go for a one-time order
  9. Review your order details
  10. Confirm your order, then question that decision in a spectacular display of buyer’s remorse.


Amazon Toilet Paper Search Results


Or, you can avoid the struggle and just tell your Amazon Echo to order you some toilet paper.


Alexa Order Toilet Paper


The rise of conversational AI has been made possible by recent breakthroughs in human parity level speech detection and smarter sentiment analysis. Though humans have been speaking and writing for a lot longer than they’ve been using GUIs, systems that rely on language as its medium of interaction are difficult to build because the computer has to be able handle user commands that are ambiguous or hard to interpret.

There are countless commercial applications for conversational AI. Conversational AI powers the customer engagement in chatbots, voice experiences, and digital assistants like Google Assistant, Siri, and Amazon’s Alexa. Popular user-facing applications include brand engagement and storytelling, like Disney’s hugely successful reengagement campaign for Zootopia and customer service where conversational AI has been hugely successful in lowering the cost per service ticket while scaling up the number of customer support requests a business can handle.  You can browse our bot directory for more inspirations on how conversational AI can be used in business. 

This article provides an overview of the six primary ways that conversational AI systems are built today, including both traditional approaches and novel, state-of-the-art techniques.  


Six Ways to Build Conversational AI

Let’s say that you live in a parallel universe where conversational AI already exists but bots for the commercial trafficking of toilet paper do not.

One of the first decisions that you’d need to make is how your bot will process dialogue inputs and produce replies (each armed with a potentially different approach to NLP and NLU). Most current production systems used rule-based or retrieval-based methods, while generative methods, grounded learning, and interactive learning are active areas of research.


1. Rule-Based

Rule-based systems are trained on a predefined hierarchy of rules that govern how to transform user input into output dialogue or actions. Rules can range from simple to complex, and a rule-based system is relatively straightforward to create. However, these systems aren’t able to respond to input patterns or keywords that don’t match existing rules.


Rule-Based Dialog Script


Remember Microsoft DOS and how painful it was to use? MS-DOS and other terminal interfaces are actually examples of rule-based conversational interface. Though the user has to learn a terse and difficult-to-learn array of commands, the system responds in a predictable manner if the user provides the correct command. As older users may recall that MS-DOS offered no error-handling; if the commands and associated syntax weren’t entered exactly as directed, then the system simply threw an error message and did nothing.


Microsoft DOS Bad Filename Bad UX


Rule-based conversational systems don’t have to suck. Eliza, an MIT chatbot created in the 1960s, fooled many users into thinking that it was a real therapist with its sophisticated rule-based dialogue generation. Eliza first scanned the input text for keywords, assigned each keyword a programmer-designated rank, decomposed and reassembled the input sentence based on the highest-ranking keyword, and if it encountered remarks that didn’t match any known keyword, prompted the user to provide more input (“Tell me more about that”). Apparently that was enough to make some people think that Eliza was a better listener than their human acquaintances!


Eliza Chatbot Rogerian Therapist Chatbot UX


2. Retrieval-Based

Retrieval-based methods power the bulk of production systems in use today.

When given user input, the system uses heuristics to locate the best response from its database of pre-defined responses. Dialogue selection is essentially a prediction problem, and using heuristics to identify the most appropriate response template may involve simple algorithms like keywords matching or it may require more complex processing with machine learning or deep learning. Regardless of the heuristic used, these systems only regurgitate pre-defined responses and do not generate new output.

Mitsuku, one of the world’s most popular open domain conversation chatbots, contains over 300,000 hand-coded AIML response patterns and a knowledge base containing over 3000 objects. Using these response patterns, Mitsuku can even construct poems and songs about a given topic. Here’s a rhyme she created about chatbots. 


Mitsuku Chatbot Poem About Chatbots


Retrieval-based systems need a lot of data pre-processing for their data and custom application logic. For example, the original IBM Watson was built for the sole purpose of competing on Jeopardy!, and it had sophisticated modules to preprocess questions, generate answers, and score hypotheses.


IBM Watson DeepQA Architecture


However, all of these functionalities require specific domain expertise and a lot of hand engineering. This also means that their knowledge bases can quickly become outdated and have to be manually updated, in turn limiting their ability to adapt to new domains, languages, or use cases. All of these obstacles also make retrieval-based systems difficult to personalize or scale.


3. Generative Methods

Overcoming the limitations of the previous two approaches requires that the conversational AI be smart enough and creative enough to generate new content. Instead of drawing upon pre-defined responses, conversational AI that use generative methods is given a large amount of conversational training data in order to learn how to generate new dialogue that resembles it.

While Mitsuku or other retrieval-based systems must follow a series of steps to prepare data and to define functionality, generative models are trained end-to-end rather than step-by-step. In other words, developers should be able to build this type of conversational AI using only machine learning and training data, and it would require no manual engineering or domain expertise. In the ideal case, these features make the system much more scalable and adaptable in the long run.

Supervised learning, reinforcement learning, and adversarial learning currently dominate in the building of generative systems, and developers can combine all three approaches to do multi-step training for conversational agents. 

Supervised learning frames conversation as a sequence-to-sequence problem, where user input is mapped to a computer-generated response. However, sequence to sequence learning tends to prioritize high-priority, high-probability response content (i.e. “I don’t know”). Such systems also have trouble incorporating proper nouns into their speech because they occur at a much lower frequency in dialogue as compared to other classes of words. All of these issues add up to systems that are boring and repetitive to talk to and likely would not promote sustained human engagement.


Sequence 2 Sequence Generative Language Model


To address that issue, developers augment supervised learning with reinforcement learning, which models how agents should take actions that optimize for some cumulative reward, which in this case would be sustained human interest and engagement with the conversational AI.

Though it sounds promising, Andrew Ng ranked reinforcement learning dead last in terms of its ability to provide economic value to businesses. The process would work well if the decision process can be modeled as a Markov Decision Process (MDP), in which all of the information that the system needs for making the optimal next action is contained in the present state, making the preceding states irrelevant to the decision-making process. While Go is a great example of finite game of perfect information, conversation is not, and there is no guarantee even after simulating sample responses millions and millions of times that a system will come close to generating a “perfect” or even an “acceptable” response.

The compromise for conversational AI was to model the decision-making process as a partially observable Markov decision process (POMDP), in which system dynamics are determined by an MDP, but the actual state cannot be determined by observation. Instead, the agent observes the system’s current conditions and then formulates a probabilistic belief on what the system’s state may be. However, finding the optimum policy in a POMDP is generally considered to be a very difficult problem, which makes this alternative also useless for commercial deployment.


Partially Observable Markov Decision Process Dialog Systems Steve Young Cambridge



The solution for contemporary production systems is to split training into two stages. In the observational phase, the conversational model uses supervised learning on existing dialogue to imitate human behavior. Then, during the trial and error phase, the model uses reinforcement learning to adapt to new situations and dialogue inputs that did not exist in the training data.

Adversarial learning has also been used to improve neural dialog output. With adversarial training, conversational agents learn via a miniature Turing Test where a generator network creates plausible human-like responses while a discriminator network judges whether they are real human conversations or computer generated output.


Adversarial Learning For Dialog Systems


While adversarial methods have worked well for images (such as with the use of GANs, generative adversarial networks), they aren’t as productive for use in dialog systems. Unlike pixel values, words are discrete and cannot be infinitesimally perturbed.

Additionally, teaching our conversational agents to mimick humans may not be the ideal training approach. Anyone who has ever observed two grown men getting into a Twitter fight knows that even human-level intelligence is not always a sufficient condition for productive conversations.


4. Ensemble Methods

Recent, state-of-the-art conversational AI such as Alexa prize bots, which were designed to be conversational bots that could talk about any subject (a very difficult problem!), have been built with ensemble methods, which use some combination of rule-based, retrieval-based, and generative method approaches as dictated by context.

They may use a rule-based approach to sing a song, a retrieval-based approach to talk about the news, and a generative approach to handle other, unspecified use cases. The most advanced systems use hierarchical reinforcement learning, which uses a low-level dialogue policy to address the immediate task, while a higher-level policy coordinates model selection or other strategic goals.


Alexa Prize Ensemble Chatbot


Though the use of ensemble methods seems promising from a methodological perspective, two-star reviews on Amazon suggest that conversational AI that use this approach still have a long way to go before they can replace humans for your conversational needs.


5. Grounded Learning

Human dialogue relies extensively on context and external knowledge. For example, if you told a chatbot that you were going to the Swan Oyster Depot, that chatbot would probably recognize Swan Oyster Depot as a restaurant, possibly a seafood restaurant, and it may tell you to have a good time. Telling a local may result in a recommendation for the Sicilian sashimi, but telling someone who watches a lot of CNN may instead get you a monologue about Anthony Bourdain’s fervent love of the place.

Your human conversational partner drew upon personal knowledge to tell you something novel. The chatbot would probably not have, because that data, like the bulk of human knowledge, was probably not in its training dataset. That inability to incorporate real-world knowledge also means that generative models are still very bad at creating useful or meaningful chatter.


Grounded vs. Ungrounded Dialog Systems


Most human knowledge does not reside in structured datasets and continue to exist as vast quantities of unstructured data, in the form of text and images. Mitsuku, which has won the Loebner prize three times for being the most “human-like” chatbot, is interesting to talk to because its dialogue can draw upon related knowledge about subjects in its knowledge base. While generative models may be more inventive when creating dialogue, Mitsuku is actually better “grounded” because of its ability to learn and to use real-world knowledge representations.

The problem becomes more difficult if logical reasoning is involved. If you asked your conversational AI to identify the piece of sashimi next to the green leaf in picture, it probably wouldn’t be able to do so. What is a fairly intuitive process for humans is difficult for the AI, as it would have to 1. identify which object is the green leaf 2. know what sashimi is 3. understand the concept of “next to” 4. identify the correct piece of sashimi if there were several on the plate and 5. match the texture and color of that piece of sashimi to the correct fish candidate.

Grounded learning is an area of active research. A potential solution to the sashimi identification task above is to use modular neural network architecture. Much like how the sashimi identification task was broken into its conceptual parts above, small neural networks that understand a single concept is set upon one component of a task. As the supervising system parses the input sentence, it generates a larger neural network on the fly that is customized to that particular sentence and task.


Grounded Neural Network Model Andreas Berkeley


Grounded learning still faces many problems and challenges, one of which is the challenge of accessing knowledge bases in the context of end-to-end differentiable training for neural networks. For backpropagation to be used to train an entire network, the mechanisms which access external knowledge bases must also be fully differentiable. 

Novel architectures, such as Neural Turing Machines, employ fully differentiable addressing mechanisms to enable neural networks to access and manipulate external memory. In the coming years, we expect to see increased integration between neural networks and knowledge graphs to enable relevant references while maintaining the scalable, data-driven approach of neural dialog models.


6. Interactive Learning

Language is inherently interactive. Humans use language to facilitate cooperation when they needed to solve problem together, and practical needs influences how language continues to develop.

For conversational AI, interactive learning remains an area of active study despite decades of continued development. Terry Winograd’s SHRDLU (1968-1970) and Percy Liang’s more modern version SHRDLURN (2016) are two examples of simple cooperative learning games.


Terry Winograd SHRDLU


In SHRDLURN, the human operator knows the desired goal of the game but has no director control over the game pieces; the computer has control but does not understand language. The human player’s goal is to iteratively instruct the computer to map language to concepts until it can perform the correct actions to complete the task.

Again, though this task looks intuitive to humans, it is hard for computers because they have no prior conception of language. It does not understand the difference between red or blue, or whether an object is a pyramid or a cube.


Percy Liang Stanford NLP SHRDLURN


As it turns out, the actual language that human players used to teach the computer turned out to be less important than their ability to issue clear and consistent commands. Based on his experiences with SHRDLURN, Liang observed, “How do we represent knowledge, context, memory? Maybe we shouldn’t be focused on creating better models, but rather better environments for interactive learning.”

For a more detailed overview of interactive learning, read my article on approaches in natural language processing and understanding.


Which Conversational AI Method Should You Use?

Conversational UI is becoming increasingly ubiquitous in everyday life. This trend will only become increasingly more pronounced as we become more used to talking to our phones and intelligent speakers. Eventually we expect all graphical user interfaces to be replaced or augmented by conversational agents. 

The bulk of production conversational AI systems that power chatbots, digital assistants, and customer support experiences are retrieval-based methods, as are most third-party platforms that enable you to develop conversational bots quickly.

If you want to use a more novel approach, such as grounded or interactive learning, you’ll likely be limited by the research capabilities of your machine learning engineering team, since these are less proven R&D directions.

Deciding on a technical approach is only the first step to building a successful bot. Unfortunately, many chatbots with impressive architectures still fall prey to common user experience failures or fail to perform against your important user engagement metrics.

In my extended talk below on “The State of Conversational Artificial Intelligence”, I give a technical overview of the approaches highlighted above but also dive into important business and design considerations that you need to consider when building successful conversational AI systems.