This article is authored by Keyur Faldu and Dr. Amit Sheth. This article elaborates on a niche aspect of the broader cover story on “Rise of Modern NLP and the Need of Interpretability!”At Embibe, we desiderate answers to the open questions while we build the NLP platform to solve numerous problems for the academic content.
Modern NLP models (BERT, GPT, etc) are typically trained in the end to end manner, carefully crafted feature engineering is now extinct, and complex architectures of these NLP models enable it to learn end-to-end tasks (e.g. sentiment classification, question answering, etc.) without specifying the features explicitly . Linguistic features (like part-of-speech, co-reference, etc) have played a key role in the classical NLP. Hence, it is important to understand how modern NLP models are arriving at decisions by “probing” into what all they learn. Do these models learn linguistic features from unlabelled data automatically? How can we interpret the capabilities of modern NLP models? Lets probe.
If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.
Linguistics: The Background
Linguistic knowledge is an essential aspect of natural language processing. We can think of it in the following dimensions,
- Syntax: analyzing the structure of sentences and the way words are connected.
- Morphology: deals with the inner structure of individual words and how new words are formed from morphs of these base words.
- Phonology: the study of the system of sounds comprising speech, that constitute fundamental components of language.
- Semantics: deals with the meaning of individual words and entire texts.
In statistical methods and classical machine learning, solving any problem related to natural language processing involves deriving linguistic knowledge described above. Thus, the research community gave attention to numerous tasks related to linguistic knowledge. We can see a few examples as below:
- Part-of-speech: Syntactic category of words, i.e., noun, verb, adjective, pronoun, etc.
- Constituency Trees (or phrase structure grammar): Phrase structure rules consider that sentence structure is constituency-based, and a parse tree arranges these constituents in a tree structure with constituency relation.
- Dependency Trees (or dependency grammars): Dependency grammar rules consider that sentence structure is dependency-based, and the dependency parse tree arranges words in a tree structure with dependency relation.
- Coreference: Relationship between two words or phrases with the common referent.
- Lemmatization: Deriving base lemma word after removing prefixes or suffixes using morphological analysis.
Above are a few examples of important tasks related to linguistic knowledge, where part-of-speech mainly deals with syntactic knowledge, dependency trees, and co-references are important to further understand semantics, and lemmatization is an example of morphology.
Numerous other tasks further analyze the linguistic properties of a sentence, like semantic roles, semantic proto-roles, relation classification (lexical and semantic), subject noun, main auxiliary verb, subject-verb agreement, etc.
Modern NLP Models
Modern NLP models are either LSTM based or transformer based. ELMO and ULMFIT are examples of LSTM architecture based language models. In contrast, BERT  and GPT are examples of transformers architecture based language models. For the rest of the study, let’s take an example of “BERT” as a reference.
- The BERT model is pre-trained with an objective of masked word prediction, and next sentence prediction on massive unlabeled data.
- The pre-trained BERT model is fine-tuned by extending it with the task-specific layers for tasks like ‘sentiment analysis,’ ‘text classification,’ or ‘question answering’ with limited labeled data.
Representations produced by the pre-trained BERT models encode relevant information, which enables task-specific fine-tuning with very limited labeled data. The question is,
What Linguistic Knowledge is Encoded in BERT?
As a result, a flurry of research sought to understand what kind of linguistic information is captured in neural networks. The most common theme across different approaches can be grouped as “probes” (or probing classifiers, diagnostic classifiers, auxiliary prediction tasks), which probes how internal mechanisms of neural networks can classify (or perform on) auxiliary linguistic tasks (or probe tasks, or ancillary tasks).
How do “Probes” work?
- Probes are shallow neural networks (often a classifier layer), inserted on top of intermediate layers or attention heads of a neural network trained for a primary task. Probes help to investigate what information is captured by different layers, or attention heads. Probes are trained and validated using auxiliary tasks to discover if such auxiliary information is captured.
- Figure 3 illustrates, how probe classifiers can be inserted on top of different layers or attention heads, to discover the encoded information related to auxiliary tasks by different layers and attention heads.
- Let say, we want to investigate if encoded representations from the BERT model capture linguistic information, like “if a verb is an auxiliary verb” or “if a phrase is a subject noun”. Auxiliary verbs are the helping verbs, and subject nouns are noun phrases that act as a subject. These tasks can be framed as “auxiliary tasks” for probes. For example, in the sentence “Kids are playing cricket all day,” — “are” is an auxiliary verb, “playing” is the main verb, “Kids” is the subject noun, and “Cricket” is an object noun.
- If a probe classifier is not able to do well on auxiliary task for linguistic information, that means such information is not encoded in internal representations of a model, also possible because it might not be needed to solve primary objectives of the model.
How are “Probes” different from Fine-Tuning or Multi-Task Learning?
- “Probes” are not related to fine-tuning for downstream tasks neither in its goal nor in approach.
- Table 1 shows the comparative landscape.
- “Probes” are to discover encoded linguistic knowledge, whereas fine-tuning and multi-tasks learning trains the model on one or multiple primary tasks.
- As illustrated in figure 4, “Probes” can access model internals but can not update model parameters, on the other hand, fine-tuning and multi-tasks learning does not access model internals, but they can update model parameters.
- “Probes” should be shallow in terms of complexity, (i.e. a single layer classifier on top of the model), whereas fine-tuning and multi-task learning can stack up deep layers depending upon the downstream tasks complexity .
What are Different Types of “Probes”?
These probing classifiers can be categorized based on what neural network mechanisms they are leveraging to probe for the linguistic knowledge. These are mainly
- Internal Representations: A small probe classifier is built on top of internal representations from different layers to analyze what linguistic information is encoded at different layers.
- Attention weights: Probe classifiers are built on top of attention weights to discover if there is an underlying linguistic phenomenon in attention weights patterns.
(A) Internal Representations based “Probes”:
Quite a few techniques are probing how much linguistic knowledge is encoded in internal representation at different layers of models like BERT. Let’s take a look at a couple of examples.
(A.1) Edge Probing: A framework introduced by Tenney et al.  aims to probe linguistic knowledge encoded in contextualized representations of a model.
- For auxiliary tasks like Part-of-Speech, Constituents, Dependencies, Entities, Semantic Role Labelling, Semantic Proto Roles, and Coreference resolutions, it has compared the performance of contextualized representations of models like BERT, GPT, ELMO, and CoVe.
- Edge probing decomposes structured-prediction tasks into a common format, where a probing classifier receives a text span (or two spans) from the sentence and must predict a label such as a constituent or relation type, etc. from per-token embeddings for tokens within those target spans.
- The macro average of performance overall the auxiliary tasks for the BERT-Large model was 87.3, whereas the baseline probe using non-contextualized representations achieved 75.2. So, about 20% of additional linguistic knowledge was injected into as part of contextualization.
(A.2) BERT Rediscovers the Classical NLP Pipeline: Tenny et al.  further analyzed where linguistic knowledge comes from.
- Center of Gravity: Center of gravity reflects the average layer attended to compute scalar mixing (weighted pooling) of internal representations at different layers. For each task, intuitively, we can interpret a higher center of gravity means that the information needed for that task is captured by higher layers.
- Expected layer: Probe classifier is trained with the scalar mixing of internal representations of different layers. Contribution (or differential score) of layer i is computed by taking the difference of “performance of probe trained with layers 0 to i” with “performance of probe trained with layer 0 to i-1”. The expected layer is the expectation of differential score over each layer.
- In figure 5, row labels are auxiliary tasks for probing linguistic knowledge. F1 scores for probe classifiers for each task are mentioned in the first two columns, where l=0, indicates auxiliary tasks performance on non-contextual representations, and l=24 indicates auxiliary tasks performance by mixing contextual representations from all 24 layers of the BERT model. Expected layers are shown in purple color (and the center of gravity is shown in dark blue color).
- The expected layer is where the maximum additional linguistic knowledge comes from. And, it can be seen that linguistic knowledge about syntactic tasks gets acquired in initial layers, and for semantic tasks gets acquired in later layers.
(B) Attention weights based “Probes”:
“What Does BERT Look At? An Analysis of BERT’s Attention,” Clark et al.  probe attention weights for linguistic knowledge in BERT. It was intriguing to notice how specific attention heads are expressing linguistic phenomena, and attention heads combinations predict linguistic tasks such as dependency grammar that is comparable to the state of the art performance.
(B.1) Specific Attention Heads
- As can be seen in figure 6, specific attention heads in BERT express specific linguistic phenomena, where a token attends other tokens depending on the linguistic relation expressed by the attention head.
- Visualizations of six different attention heads are shown above. The BERT base model has 12 layers, and each layer has 12 attention heads. The top-left plot in figure 5 represents the 10th attention head in the 8th layer. And the patterns where objects are attending to their nouns are evident. Similarly, in the 11th attention head of the 8-th layer, noun modifiers (determiners, etc.) are attending to their nouns. Similarly, we can notice how attention heads in other plots are expressing linguistic knowledge.
- It is really surprising to notice how attention heads perform as readily available probe classifiers.
As shown in table 2, for each dependency relationship, how a specific attention head achieves classification performance of predicting dependent token. For cases like determinant (det), direct object (dobj), possessive word (poss), passive auxiliary (auxpass), etc performance gain was huge (~100%) compared to the baseline model (predicting a token at the best fixed offset).
(B.2) Attention Head Combinations
- Probe classifiers trained on directly taking linear combinations of attention weights, and attention weights with non-contextual embeddings like GloVe, gave a comparable performance to relatively complex models depending on internal contextual representations for dependency parsing tasks.
- Similarly, experiments on coreference, resolution tasks also suggested similar potential. That said, we can conclude that attention mechanisms in BERT also encode and express linguistic phenomena.
Probing the “Probes”
Now that we got introduced to representation based probes and attention weights based probes to discover the encoded linguistic knowledge using auxiliary tasks, it would be interesting to ask deeper questions:
- Are bigger models better to encode linguistic knowledge?
- How to check for the generalization ability of a model to encode linguistic knowledge?
- Can we decode linguistic knowledge instead of relying on shallow probe classifier labels?
- What are the limitations of probes, and how to draw conclusions?
- Can we infuse linguistic knowledge?
- Does encoded linguistic knowledge capture meaning?
- Is encoded linguistic knowledge good enough for natural language understanding?
Lets elaborate further on the above questions in the next article Linguistic Wisdom of NLP Models.
- Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. NAACL 2019.
- Belinkov et al. “Analysis Methods in Neural Language Processing: A Survey”, ACL 2019
- Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning, “What Does BERT Look At? An Analysis of BERT’s Attention”, 2019
- Ian Tenney, Dipanjan Das, Ellie Pavlick, “BERT Rediscovers the Classical NLP Pipeline”, 2019
- Tenney et al. “What Do You Learn From Context? Probing For Sentence Structure In Contextualized Word Representations”, ICLR 2019
- Adi et al. “Fine-Grained Analysis of Sentence Embeddings Using Auxiliary Predictions Tasks”, ICLR 2017
- Stickland et al. “BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning”, ICML 2019
- Zhou et al. “LIMIT-BERT : Linguistic Informed Multi-Task BERT”, 2019
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more technical education.