Generalized Language Models: Common Tasks & Datasets

EDITOR’S NOTE: Generalized Language Models is an extensive four-part series by Lillian Weng of OpenAI.

Part 1: CoVe, ELMo & Cross-View Training
Part 2: ULMFiT & OpenAI GPT
Part 3: BERT & OpenAI GPT-2
Part 4: Common Tasks & Datasets

Do you find this in-depth technical education about language models and NLP applications to be useful? Subscribe below to be updated when we release new relevant content.

This article finalizes the series on generalized language models:

Metric: Perplexity
Common Tasks and Datasets
Reference

Metric: Perplexity

Perplexity is often used as an intrinsic evaluation metric for gauging how well a language model can capture the real word distribution conditioned on the context.

A perplexity of a discrete proability distribution $p$ is defined as the exponentiation of the entropy:

2^{H(p)} = 2^{-\sum_x p(x) \log_2 p(x)}

Given a sentence with $N$ words, $s = (w_1, \dots, w_N)$ , the entropy looks as follows, simply assuming that each word has the same frequency, $\frac{1}{N}$ :

H(s) = -\sum_{i=1}^N P(w_i) \log_2 p(w_i) = -\sum_{i=1}^N \frac{1}{N} \log_2 p(w_i)

The perplexity for the sentence becomes:

2^{H(s)} &= 2^{-\frac{1}{N} \sum_{i=1}^N \log_2 p(w_i)} = (2^{\sum_{i=1}^N \log_2 p(w_i)})^{-\frac{1}{N}} = (p(w_1) \dots p(w_N))^{-\frac{1}{N}}

A good language model should predict high word probabilities. Therefore, the smaller perplexity the better.

Common Tasks and Datasets

Question-Answering

SQuAD (Stanford Question Answering Dataset): A reading comprehension dataset, consisting of questions posed on a set of Wikipedia articles, where the answer to every question is a span of text.
RACE (ReAding Comprehension from Examinations): A large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students.

Commonsense Reasoning

Story Cloze Test: A commonsense reasoning framework for evaluating story understanding and generation. The test requires a system to choose the correct ending to multi-sentence stories from two options.
SWAG (Situations With Adversarial Generations): multiple choices; contains 113k sentence-pair completion examples that evaluate grounded common-sense inference

Natural Language Inference (NLI): also known as Text Entailment, an exercise to discern in logic whether one sentence can be inferred from another.

RTE (Recognizing Textual Entailment): A set of datasets initiated by text entailment challenges.
SNLI (Stanford Natural Language Inference): A collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral.
MNLI (Multi-Genre NLI): Similar to SNLI, but with a more diverse variety of text styles and topics, collected from transcribed speech, popular fiction, and government reports.
QNLI (Question NLI): Converted from SQuAD dataset to be a binary classification task over pairs of (question, sentence).
SciTail: An entailment dataset created from multiple-choice science exams and web sentences.

Named Entity Recognition (NER): labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names

CoNLL 2003 NER task: consists of newswire from the Reuters, concentrating on four types of named entities: persons, locations, organizations and names of miscellaneous entities.
OntoNotes 0.5: This corpus contains text in English, Arabic and Chinese, tagged with four different entity types (PER, LOC, ORG, MISC).
Reuters Corpus: A large collection of Reuters News stories.
Fine-Grained NER (FGN)

Sentiment Analysis

SST (Stanford Sentiment Treebank)
IMDb: A large dataset of movie reviews with binary sentiment classification labels.

Semantic Role Labeling (SRL): models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom”.

CoNLL-2004 & CoNLL-2005

Sentence similarity: also known as paraphrase detection

MRPC (MicRosoft Paraphrase Corpus): It contains pairs of sentences extracted from news sources on the web, with annotations indicating whether each pair is semantically equivalent.
QQP (Quora Question Pairs) STS Benchmark: Semantic Textual Similarity

Sentence Acceptability: a task to annotate sentences for grammatical acceptability.

CoLA (Corpus of Linguistic Acceptability): a binary single-sentence classification task.

Text Chunking: To divide a text in syntactically correlated parts of words.

CoNLL-2000

Part-of-Speech (POS) Tagging: tag parts of speech to each token, such as noun, verb, adjective, etc. the Wall Street Journal portion of the Penn Treebank (Marcus et al., 1993).

Machine Translation: See Standard NLP page.

WMT 2015 English-Czech data (Large)
WMT 2014 English-German data (Medium)
IWSLT 2015 English-Vietnamese data (Small)

Coreference Resolution: cluster mentions in text that refer to the same underlying real world entities.

CoNLL-2012

Long-range Dependency

LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects): A collection of narrative passages extracted from the BookCorpus and the task is to predict the last word, which require at least 50 tokens of context for a human to successfully predict.
Children’s Book Test: is built from books that are freely available in Project Gutenberg. The task is to predict the missing word among 10 candidates.

Multi-task benchmark

GLUE multi-task benchmark: https://gluebenchmark.com
decaNLP benmark: https://decanlp.com

Unsupervised pretraining dataset

Books corpus: The corpus contains “over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.”
1B Word Language Model Benchmark
English Wikipedia: ~2500M words