To build a state-of-the-art dialog system, you need challenging tasks for model training and evaluation. There are numerous dialog datasets that assist researchers in building task-oriented and chit-chat dialog agents. In particular, the Facebook Research team has introduced a framework, called ParlAI (pronounced par-lay), where they’ve gathered together 80+ popular dialog datasets.
We have reviewed ParlAI, recent research papers, and other sources to select the most popular tasks for training and evaluation of dialog agents. Our list is not exhaustive but it can be a good starting point for those developing state-of-the-art conversational agents.
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
Open-domain dialog datasets
Here are the key datasets for open-domain (chit-chat) dialogs.
~ 500 dialogs
~ 12K utterances
This is an English-language dataset consisting of 502 dialogs between a user and an assistant discussing movie preferences in natural language. The dataset was collected using a Wizard-of-Oz methodology, where paid crowdworkers played the roles of a user and an assistant.
~ 9K dialogs
~ 90K utterances
This is a dataset containing movie chats wherein each response is explicitly generated by copying and/or modifying sentences from plots, comments, and reviews about the movie. The dataset was collected using the self-chat setup, where the crowdworker was shown the movie’s plot, review, comments, and fact table, provided with an opening statement, and asked to continue the conversation with at least 8 utterances.
~ 13K dialogs
~ 103K utterances
This dataset is crawled from various websites which help English learners to practice English conversation in daily life. Thus, the dialogs are human-written, focus on a certain topic in a certain context (e.g., shopping), and include a reasonable number of speaker turns. DailyDialog dataset also has strong annotations for topic, emotion, and utterance act.
~ 11K dialogs
~ 162K utterances
To build this dataset, the developers first crowdsourced a set of 1155 possible personas, each consisting of at least 5 profile sentences. Then, they paired crowdworkers, assigned them each a random persona, and asked them to chat.
~ 20K dialogs
~ 146K utterances
This dataset is an extended version of the Persona-Chat dataset and was prepared for the Conversational Intelligence Challenge at NeurIPS 2018. Considering that the original Persona-Chat test set was released publicly, the developers of the ConvAI2 task crowdsourced further data for a hidden test set with 100 new personas and over 1,015 dialogs. They also extended the training set by rephrasing, generalizing, or specializing the original utterances.
~ 22K dialogs
~ 202K utterances
This dataset spans over 1300 diverse topics and includes conversations directly grounded with knowledge retrieved from Wikipedia. The dialogs are carried on by two participants, where one plays the role of a knowledgeable expert (wizard) and the other is a curious learner (the apprentice).
~ 107K utterances
This dataset is grounded in emotional situations to facilitate training and evaluation of empathy in dialog agents. It was collected via crowdsourcing, where one participant (speaker) selects the emotion word, describes the corresponding situation and discusses it with another participant (listener). Each conversation is allowed to be 4-8 utterances long, and the average is 4.31 utterances per conversation.
~ 7K dialogs
~ 76K utterances
This dataset is explicitly designed to exhibit multiple conversation modes: displaying personality, showing empathy, and demonstrating knowledge. It was collected via human-to-human conversations with one ‘unguided’ speaker and one ‘guided’ speaker. The last one is shown three suggested responses in each turn. The responses are generated by models trained on the ConvAI, Wizard of Wikipedia, or EmpatheticDialogues tasks, and a speaker is free to either use and modify or ignore those responses.
If you’re interested to learn more about recently introduced tasks for dialog agents, check out these research papers featuring the latest approaches to building challenging dialog datasets.
Task-oriented dialog datasets
Here’re the most popular datasets that have been collected to facilitate research on task-oriented (goal-oriented) dialog systems.
~ 15K human-computer dialogs
Dialog state tracking challenge (DSTC) is the first common testbed and evaluation suite for dialog state tracking. It was collected by transcribing telephone calls between real passengers and dialog systems built by three different research groups. The goal of dialog systems is to provide bus riders with bus timetable information.
~ 3.2K human-computer dialogs
This dataset presents a second dialog state tracking challenge, which introduces some additional features: a new domain (i.e., restaurant booking), changing user goals, and a richer dialog state. In contrast to the first challenge, DSTC2 was collected with the help of Amazon Mechanical Turkers, who were asked to call the dialog systems and find restaurants that matched specific constraints on area, price range, and food.
~ 10K human-to-human dialogs
This is a multi-domain booking dataset, covering 7 domains: Attraction, Hospital, Police, Hotel, Restaurant, Taxi, Train. The dataset was collected using a Wizard-of-Oz setup whereby conversations were conducted between two crowdworkers, a ‘wizard’ and a ‘user’. The user is provided with a goal (e.g. book a taxi to the hotel) and chats with the wizard to achieve this goal. The latest corrected version of the dataset (MultiWOZ 2.2) is available on GitHub.
~ 13K dialogs
Taskmaster is yet another multi-domain booking dataset. Its authors claim that it has richer and more diverse language and involves more real-world entities than MultiWOZ. Some of the dialogs (~ 5.5K) were collected through a web-based interface, where crowdsourced workers playing ‘users’ were communicating with human operators but were led to believe they were interacting with the automated system. The rest of the dialogs (~7.7K) were written by crowdsourced workers based on a suggested scenario.
For more datasets, see the survey below on task-oriented dialog datasets by Zhang et al. (2020).
Critique of dialog datasets
During the last few years, we have observed significant progress with regard to the development of challenging dialog datasets. The recent task-oriented datasets, such as MultiWOZ and Taskmaster, cover several domains and include dialogs where users change their goal during the conversation. However, even the most sophisticated dialog datasets suffer from a number of issues.
The Rasa research team discusses these issues in their paper Where is the context? – A critique of recent dialogue datasets. Here are the problems they point out:
- Ambiguous system actions. Dialog agents trained on these tasks cannot learn deterministic behavior because the same dialog history is often followed by different actions. To address this problem, the researchers suggest enforcing an automatic system response during data collection if the same dialog state has been encountered before.
- History independence. It looks like dialog models mostly benefit only from the last user input and preceding system action, and can perform quite well without the entire dialog history. This may indicate an unnatural simplicity of the dialogs in the datasets. The issue may arise from the fact that crowdworkers are not actually interested in the information they obtain and are instead motivated to finish a dialog as soon as possible.
In our research summaries, we’ve curated and featured the recent research papers that introduce novel approaches to building open-domain and task-oriented dialog datasets.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.