Hotel bookings, weather queries, restaurant reservations – chatbots and virtual assistants seem to be the perfect solution for these kinds of routine inquiries. Not surprisingly, task-oriented dialog systems that support these chatbots have become a hot topic in the AI and NLP research community.
In this article, we will review the recent research breakthroughs in task-oriented (or goal-oriented) dialog systems, see the open challenges for building a good task-oriented chatbot, and list some of the most appealing directions for further research. We’ll draw on the key ideas from two recent research papers:
- Recent Advances and Challenges in Task-oriented Dialog System by Zheng Zhang, Ryuichi Takanobu, Qi Zhu, Minlie Huang, and Xiaoyan Zhu
- A Survey on Dialogue Systems: Recent Advances and New Frontiers by Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
Existing studies on task-oriented dialog systems are usually classified into two broad categories: (1) pipeline methods and (2) end-to-end methods.
In the first case, the whole system is divided into several components, such as natural language understanding (NLU), dialog state tracking (DST), dialog policy learning, and natural language generation (NLG). Some other combination modes are also possible. In contrast to pipeline approaches, end-to-end methods build a dialog system using a single model, where natural language context is taken as input and natural language response is generated as an output.
Obviously, pipeline systems with their modular structure are more interpretable and stable, and thus a better fit for real-world commercial applications. At the same time, end-to-end systems require fewer annotations, making them another promising alternative for business applications. So, let’s review the recent research advances in both pipeline approaches and end-to-end models.
Natural language understanding (NLU)
This component parses a user’s utterance into a structured semantic representation, usually consisting of intent and slot-value pairs. Intent here indicates the function of the utterance, e.g. querying or providing information. Slot–value pairs are semantic elements mentioned in the utterance. For example, in the utterance “Can you recommend a Chinese restaurant in Manhattan?”, the slot–value pairs can be “cuisine” – “Chinese” and “location” – “Manhattan”.
Intent detection and slot–value extraction can be addressed using RNNs, CNNs, recursive neural networks, recurrent conditional random fields, or a BERT model. Zhang et al. (2020) also mention some more recent approaches to strengthening the connection between intent classification and slot tagging, e.g. Liu and Lane (2016) applied attention mechanisms to allow interaction between word and sentence representations, while Goo et al. (2018) used an intent gate to direct the slot tagging process.
Dialog state tracking (DST)
At this step of the pipeline, the dialog state tracker estimates the user’s goal by taking the entire dialog context as input. In most recent studies, the user’s goal is represented by slot–value pairs.
In 2013, Henderson et al. from the University of Cambridge introduced deep learning for dialog state tracking, where they used a sliding window to output a sequence of probability distributions over an arbitrary number of possible values. Fixed vocabulary for the slot allows applying classification for value prediction. For free-form slots, the modern approaches suggest either generating values directly or predicting the span of the value in the utterance.
Dialog policy learning
After the dialog state representation is defined, the dialog policy generates the next system action. Policy learning is usually optimized either with supervised learning or reinforcement learning. Sometimes a rule-based approach is first employed to warm-start the system.
Most of the modern approaches to dialog policy learning rely on reinforcement learning (RL). Cuayáhuitl et al. (2015) applied deep RL technique, where feature representations and dialog policy were learned simultaneously. However, training an RL policy requires lots of interactions with human users, which is time-consuming and costly. Thus, many approaches have been suggested to address this problem, including the use of user simulators. More recently, model-based RL approaches were suggested (Peng et al., 2018; Wu et al., 2019; Su et al., 2018), where the environment is modeled to simulate the dynamics of the conversation. The dialog policy is then trained alternately through learning from real users and planning with the environment model.
Natural language generation (NLG)
This component maps the dialog act generated by the dialog policy to a natural language utterance. The response should correspond to a dialog act to ensure task completion, and also be natural, specific, and informative.
This can be approached as a conditional language generation task. Recently, Peng et al. (2020) proposed to start by pre-training a GPT with a large-scale corpus, and then fine-tuning the model on target NLG tasks with a small number of training samples.
Chen et al. (2020) mention two key limitations of pipeline approaches:
- The credit assignment problem, where the user’s feedback is hard to assign to a specific module of a pipeline
- Process interdependence, where any changes to or retraining of one component require all the other components to be adapted accordingly
These issues can be addressed by building an end-to-end neural generative model for task-oriented dialog systems. Most of these end-to-end methods utilize seq2seq models.
However, the traditional approach requires thousands of dialogs to learn basic behaviors. To improve data efficiency, Hybrid Code Networks (HCNs) were introduced at ACL 2017. They combine an RNN with domain-specific knowledge encoded as software and system action templates. Furthermore, HCNs can be trained with supervised learning, reinforcement learning, or a mixture of both. Zhao and Eskenazi (2016) also proposed deploying deep reinforcement learning to train end-to-end task-oriented dialog systems. In particular, they used a variant of Deep Recurrent Q-Networks (DRQN).
Despite the significant progress of recently introduced task-oriented dialog systems, there are still many issues that remain unsolved.
Dialog is a complex NLP task, where a system needs to learn grammar, syntax, dialog reasoning, and language generation. Mastering all these skills requires lots of training data, and getting domain-specific data is quite expensive and time-consuming. Some of the recently introduced studies (Peng et al., 2020; Budzianowski and Vulic, 2019) address this data scarcity problem by using pre-training models for their dialog agents to capture the implicit language knowledge from large-scale corpora. Then, the system is fine-tuned using the data for a specific dialog task.
While open-domain dialog research primarily focuses on generating meaningful and consistent responses, task-oriented dialog agents should also be able to complete a specific task. Thus, the dialog management component of these dialog systems should aim at accurate user goal estimation and efficient dialog planning.
Knowledge base integration
Integrating external knowledge into end-to-end neural models is not a trivial task, since these methods don’t include explicit structured dialog state representation. The recent approaches have been to use memory networks for knowledge integration (Wu et al., 2019), to store knowledge base tuples in separate context-free memory (Lin et al., 2019), or to apply a two-step knowledge base retrieval technique (Qin et al., 2019).
Even the most sophisticated dialog systems are vulnerable to simple input perturbations. Models are usually trained with very little noise in the training data, while in real applications, users’ requests and responses can be out of domain and unexpected given the training data distribution. One of the possible solutions to this problem can be to combine rule-based methods with neural models (Liang et al., 2016).
A single dialog agent often serves a large number of people. Moreover, it learns from interactions with these people, and thus can inadvertently store some sensitive information. It’s important to consider this issue when building task-oriented dialog systems.
Future research directions
Given the above mentioned challenges, some of the most appealing research directions include:
- Addressing data scarcity by using pre-training models or other approaches
- Increasing sample efficiency in policy learning via better dialog planning
- Aiming at zero-shot domain adaptation, where a dialog agent can be applied in a new domain without any domain-specific training data
- Improving the robustness of task-oriented dialog systems
- Building end-to-end task-oriented dialog models without intermediate supervision
Many interesting research ideas addressing the key challenges of building task-oriented dialog agents have been presented at the top AI and NLP academic conferences this year. In our research summaries, we’ve curated and featured the most interesting research papers that introduce new approaches to policy learning, knowledge base integration, modeling long context, and more.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.