Natural language processing is a powerful tool, but in real-world we often come across tasks which suffer from data deficit and poor model generalisation. Transfer learning solved this problem by allowing us to take a pre-trained model of a task and use it for others.
Today, transfer learning is at the heart of language models like Embeddings from Language Models (ELMo) and Bidirectional Encoder Representations from Transformers (BERT) — which can be used for any downstream task. In this article, we will understand different types of transfer learning techniques and how they can be used to transfer knowledge to a different task, language or domain.
The following topics will be covered in this chapter:
- Transfer learning
- Types of transfer learning
- To tune or not to tune
Do you find this in-depth technical education about language models and NLP applications to be useful? Subscribe below to be updated when we release new relevant content.
At present we have achieved fantastic results for many tasks like speech recognition, machine comprehension, object detection, and machine translation due to massive efforts of collecting data. These huge models are extremely data-hungry and require an immense amount of labelled data.
Most of the time there is a huge difference between these datasets available for training and its target application — for example training data available for speech recognition of USA but the target region is India.
In the last few years, we have learnt the art of making great machine learning models for audio, image, and text data with large amounts of data. Unfortunately every time we train a model on a dataset and try to use it for a different dataset, the performance deteriorates. This happens as the model fails to generalize and understand the basic patterns in the data. Even a slight difference can catch it off-guard.
This problem is important to tackle as the real-world data is ever-changing and it’s impractical to keep retraining model frequently from scratch for new scenarios. This is where transfer learning comes to the rescue.
Suppose we have a sentiment analysis task for the domain of Indian stock news. We have enough labelled data for this supervised task and train a model for it. Now when we apply the model to the task in the same domain, it will probably behave as expected. But, as soon as we apply it for the same task in another domain, such as cryptocurrencies, it will behave unexpectedly.
If we have to use the same paradigm of supervised learning, we need to collect labelled data for cryptocurrency news and train a new model. Hopefully, it will now perform great for cryptocurrency news, although who can predict crpytos! 😂
But what if we don’t have enough labelled data or cannot afford to collect it? Doesn’t it make sense to leverage the older model and train it for small labelled data we have for the cryptos?
And what if we want to add another prediction class to the model, such as neutral, while the earlier model was just trained to predict positive and negative? Transfer learning allows us to deal with these scenarios and use knowledge learned from a previous task/domain for a new one.
Let’s give a formal definition to transfer learning. Given a source domain Ds, a corresponding source task Ts, as well as a target domain Dt and a target task Tt, the objective of transfer learning now is to enable us to learn the target conditional probability distribution P(Yt|Xt) in Dt with the information gained from Ds and where Ds ≠Dt or Ts ≠ Tt.
In most cases, a limited number of labelled target examples, which is quite smaller than the number of labelled source examples are assumed to be available. The following figure illustrates the process of transfer learning:
Types of transfer learning
- Domain adaptation
- Cross-lingual learning
- Multi-task learning
- Sequential transfer learning
Based on domain and task there can be many variations which can be tackled using transfer learning:
- Xs ≠ Xt: The feature space of source and target is different. For example, the domain-specific words of stocks are different from those of cryptos. Word initial coin offering (ICO) which is specific to cryptos will never occur in the context of stocks. Also, if the languages differ, there can be a complete mismatch of feature space. This scenario is referred to as cross-lingual learning or cross-lingual adaptation.
- Ps(Xs) ≠ Pt(Xt): The marginal probability distribution of words is different for source and target. The word ledger will be used more frequently for cryptos while it will be used rarely for stocks. This scenario is generally known as domain adaptation.
- Ys ≠ Yt: Labels differ for source and target. Source had positive and negative labels while the target has neutral as well.
- Ps(Ys) ≠ Pt(Yt): The marginal probability distribution of labels is different for source and target. Positive labels occur more than negative in training but the market crashed and the real data has more negative labels than positive.
- Ps(Ys|Xs) ≠ Pt(Yt|Xt): The condition probability distribution of labels is different. This can happen if the same words can mean different or the data imbalance is different for source and target, for example, cold storage means totally different in the world of cryptos.
Now we define taxonomy as per Pan and Yang . They segregate transfer learning mainly into transductive and inductive. It is further divided into domain adaption, cross-lingual learning, multi-task learning and sequential transfer learning.
This is the most commonly occurring scenario in industry where we want to use a model trained on a task for a domain for another domain. Domain adaptation can be done with either no or minimal label data for target. Let us discuss the available approaches for this.
Representation approaches try changing the underlying distribution of data by either finding features that are common in both domain or represent both data in a shared low-dimensional space:
- Distribution similarity approaches: The prime reasoning behind distribution similarity approaches is to make the source and target data distribution similar. A naive way to achieve similar distribution is to ignore features that do not occur in the target. Major of these approaches rely on a metric of similarity calculated by available measures such as:
- Kullback-Leibler (KL) divergence
- Jensen-Shannon (JS) divergence
- Wasserstein distance
For distributional similarity approaches, a common strategy is to use a representation that minimizes the distance between the representations of the two domains, while at the same time maximizing the performance on the source domain data.
- Latent feature learning: Latent feature learning method try to represent data in a lower dimensional feature space such that it increases similarity with both source and target. The lower dimensional space can be learnt by either a factorisation algorithm like Singular Value Decomposition(SVD) or a neural network autoencoder.
Weighting and selecting data
In this methodology, we try to weigh and select instances instead of choosing features to maximize representation of source as well as target data. Instance weighting approach can be seen as a soft selection, while selecting instances can be treated as hard selection. Instance selection is more efficient as well as allows to neglect examples which can be harmful.
Xia et al. 2015 used PCA to represent the data and then used a distance metric of instances from PCA space to select them. Ruder et al. 2017b used maximum cluster difference (MCD) to define the similarity of an instance to the class. In both instances, weighting and selection are used as a pre-processing steps to select the most useful examples in NLP pipeline.
Self-labelling approaches belong to the category of semi-supervised learning where we train a model on labelled data and then use it to assign pseudo labels to unlabelled examples. Then, these examples are used to train the model again. Self-labelling approaches follow either of the two training types:
- Self-training: In this approach, as we defined earlier we try to choose only those examples with higher confidence to be added to the training data. The main downside of this approach is model’s inability to correct its own mistakes and it can get worse with time. It can also use a weighting scheme to give different weights to actual labels and pseudo labels.
- Multi-view training: In Multi-view training, we train different models with different views of the data. These pipelines can differ in different features of the data or different model architectures or data. This can be accomplished in many ways:
- Co-training: After training models on their feature sets, only those instances where a model is confident is moved to the training set of the others. In this way we can get labelled data for a model which is uncertain of the instance.
- Democratic co-learning: This method use models with different inductive biases with either different neural network architectures or algorithms.
- Tri-training: This is similar to Democratic co-learning, where we use 3 different models with their inductive bias and train them on different variations of the original training data using bootstrap sampling. After they are trained, we add an unlabelled data to the training sample if any two models agree with predicted label.
Multi-source domain adaptation
In Multi-source domain adaptation, we leverage data available in different domains to make models and use a combinatory approach:
- Combining source models: Either we can train a single model by combining training data of all sources or we can train models for individual data and average it or make an ensemble. Then these pseudo labels are used to train a new model.
- Neural network-based methods: This method use source model to find lower dimensional representations of target instances and then train subsequent layers with attention mechanisms or recurrent neural network layers. It can also use a weighting scheme as per similarity of source domain with target domain to weight outputs of models and fine-tune subsequent layers.
In this section, we will try to understand cross-lingual language models which enable us to compare words across different languages which is important for tasks like machine translation and cross-lingual retrieval. But more importantly, these embeddings can help us transfer knowledge from resource-rich to resource-poor languages by providing a common represent space.
The data for the task can exist in either parallel form which is an exact conversion (cat translation in Fig 1.4) or it can be in a comparable form where an example exists in the form of nearby image(word for a similar cat image).
Let’s discuss the three types of alignments used to learn cross-lingual word embeddings:
- Word-level alignment: This approach use dictionaries containing word-pairs in different languages. This is the most commonly used approach and can also use other modalities like images.
- Sentence-level alignment: This approach use sentence pairs which are similar to those used for making machine translation systems. They typically use Europarl corpus which is a sentence-aligned corpus of proceedings of European parliament.
- Document-level alignment: This approach requires parallel documents which have aligned translated sentences. As it’s rare to get such documents, comparable documents are used more often. Such data can be created using topics of Wikipedia and gathering data in different languages.
Usually, we train the model for only one task. But by this, we can lose out on the information which can help the model perform better. Now if we train the model for multiple tasks, it might be able to generalize better by sharing representations for all tasks.
Multi-task learning (MTL) is also known as joint learning and whenever we try to optimize more than one loss function we are practically doing MTL.
MTL improves generalization by leveraging the domain-specific information contained in the training signals of related tasks” — Caruana, 1998
The beauty of multi-task learning comes from using the same parameters for different tasks which brings us to the concept of hard and soft parameter sharing.
Hard parameter sharing
This is the most commonly used MTL method. In this the hidden layers are shared between all the tasks while the task-specific layers are kept separate as shown in the following figure:
Baxter showed that hard parameter sharing reduces the chances of overfitting by order of T as the shared parameter has to learn representations which are common to many tasks by generalising.
Soft parameter sharing
In this approach, each task has its own model and parameters. The distance between the parameters of the model is then regularized in order to encourage the parameters to be similar as shown in the following figure:
Why multi-task learning works?
It makes intuitive sense for MTL to give us a superior model and the reasoning of the methodology can be understood by the following advantages:
- Implicit data augmentation: Effectively, MTL increases the training data for our model. As all tasks have noisy data, the model has to learn a representation which ignores the data-dependent noise. As different data have different noise patterns, the model has to learn a general representation which works out best for all tasks. The joint learning averages out the noise patterns and leads to a better representation.
- Attention focusing: It can get very difficult to train model if the data is very noisy or high dimensional or limited. Training on multiple task can teach the model to focus on the most relevant features and can lead to a better model.
- Representation bias: MTL forces the model to learn representations which are useful for all tasks. This helps the model to generalize faster for all tasks in the future as the representation which works for many tasks will also work for a new one.
- Regularization: MTL acts as a regularizer (reduces over-fitting) by introducing inductive bias and reduces Rademacher complexity of the model, which is its ability to fit random noise [Søgaard and Goldberg, 2016].
MTL is used in situations where we want predictions of multiple tasks at once. Let’s go through the considerations to be taken for best interaction between main and auxiliary tasks.
As the layers can be affected adversely by auxiliary tasks, we need to consider which layers are actually worth sharing. Søgaard and Goldberg  found that when the main task has auxiliary tasks like named entity recognition (NER) or part-of-speech tagging (POS) tagging, it makes sense to share lower layers. Based on this, Hashimoto et al.  made a hierarchical architecture which consisted of several tasks for joint modelling.
Sanh et al., 2019, proposed a hierarchical architecture for semantic tasks. The model is trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low-level tasks at the bottom layers of the model and more complex tasks at the top layers of the model as shown in Fig 1.7. The tasks share common embeddings and encoders allowing an easy information flow from the lowest level to the top of the architecture.
The model achieves state-of-the-art results on the tasks of Named Entity Recognition, Entity Mention Detection and Relation Extraction and competitive results on Coreference Resolution while using simpler training and regularization procedures than previous works.
In MTL, normally, the batches are sampled uniformly from tasks. As the optimizer tries to minimize the weighted sum of loss during the training, it becomes important to find good weights. These weights can be tuned on a validation set like any other hyper-parameter.
A common approach is to give equal weight but a more sophisticated approach can be taken where the weight itself can be learnt as shown by Kendall et al., 2018.
Usually, different tasks have a different number of samples and the optimizer will optimize for the task which has maximum samples.
To work around this problem we can also sample with different probabilities from tasks which is inversely proportional to the number of samples so to get an equal number of training samples from each task. We can also sample more from the main task to give it more importance.
Adjusting the sampling ratio of different tasks has the same effect as assigning different weights.
Auxiliary task selection
The fundamental assumption behind using an auxiliary task is that it will be related to the main task and can help the main task. The relatedness of a task can be found in many ways. One of the ways is to know if main and auxiliary tasks use the same features (low-level information) for prediction.
Xue et al.  argued that two similar tasks will share similar classification boundaries. The common type of auxiliary tasks are:
- Statistical: These tasks try to predict low-level information about the input data itself such as log frequency of a word.
- Selective unsupervised: These tasks selectively try to predict a certain part of the input data. For sentiment analysis, Yu and Jiang  predict whether the sentence contains a positive or negative domain-independent sentiment word, which sensitizes the model towards the sentiment of the words in the sentence.
- Supervised tasks: This is the most common use case where we take a supervised task. Zhang et al.  used head pose estimation and facial attribute inference as auxiliary tasks for facial landmark detection; Liu et al.  jointly learning query classification and web search.
- Unsupervised tasks: The auxiliary tasks discussed so far are similar to the original task and learn representations which are common to both. But we can also train a model with an unsupervised task to induce general-purpose representation such as language modelling.
Related tasks in NLP
Sequential transfer learning
As the name implies, sequential transfer learning (STL) involves transferring knowledge with a sequence of steps, where the source and target task are not necessarily similar. Unlike MTL where the tasks are learnt jointly STL consists of two stages. In the first phase of pretraining, the model is trained on source data and in the second phase of adaptation, the source model is trained for target task.
The pretraining task is usually costly but is only performed once. The adaption task is usually faster as it acts like a fine-tuning step.
STL is useful in these three scenarios:
- Source and target task data is not available at the same time
- Source task has more data than the target task
- Adaptation to many target tasks is required
STL looks similar to MTL but is very different in the way knowledge transfer takes place. In MTL, both the source and target are trained together while in STL, first the source is trained and later target is trained:
To get the maximum benefit we want to have a source training which will benefit many target tasks. It’s difficult to find such a task in practice but it always turns out better than starting from scratch. Now, let’s discuss a bit about source training which can be accomplished in three ways:
- Distant supervision
Distant supervision uses data obtained from heuristics and domain expertise. Such data is often noisy and obtained using predefined patterns. Felbo et al.  used distant supervision to predict a large number of emojis on more than a billion tweets. Later they apply their pre-trained model not only to sentiment analysis, but also to emotion and sarcasm detection tasks, demonstrating that a specialized pretraining task can be useful for an array of related target tasks.
- Traditional supervision
Traditional supervision requires manually labelled training data. This method can leverage a lot of commonly available datasets although a suitable task data is preferable. Zoph et al.  train a machine translation model on a high-resource language pair and then transfer this model to a low-resource language pair. Yang et al. [2017a] pre-train a POS tagging model and apply it to word segmentation.
Nowadays researchers try to choose a task which requires basic understanding of the language. Such tasks include predicting the meaning of a word and image captioning. While it’s tempting to go for a large dataset to get maximum knowledge, the value of the pre-trained model depends on the similarity of source and target domain and task.
- No supervision
Unsupervised learning is the easiest way to train the source model as it only requires access to a large unlabelled text. It is also known as language modelling. Compared to supervised learning, it’s much more scalable approach as text for any domain is easily available. This approach captures much more general knowledge about the language in comparison to supervised learning which captures only those features required for the task.
Various approached have been tried to learn these representations which include Latent Semantic Analysis (LSA), Latent Dirichlet allocation (LDA), Skip-gram with negative sampling (SGNS), Global vectors (GloVe), Skip-thoughts, ELMo and BERT.
- Multi-task pretraining
To leverage the advantages of the preceding three methods we can also use MTL where all tasks can be trained jointly. MTL can help these representations generalise and make them useful for different downstream tasks. Subramanian et al.  perform multi-task pretraining on skip-thoughts, machine translation, constituency parsing, and natural language inference.
We just completed discussing the first step of STL and now we will touch the second step — adaptation. Currently, there are two approaches of using a pre-trained model for the target task — feature extraction and fine-tuning. Feature extraction uses the representations of a pre-trained model and feeds it to another model while fine-tuning involves training of the pre-trained model on target task.
- Feature extraction: In feature extraction, the model weights are frozen and the output from it is directly sent to another model. The features can either be sent to a fully connected model or we can also train a classical model like Support Vector Machine (SVM) or RandomForest on it. The benefit of using this is the task-specific model can be used again for similar data. Also, if the same data is used repeatedly, extracting feature once can save a lot of computing resources.
- Fine-tuning: In fine-tuning, as the name implies, the weights are kept trainable and are fine-tuned for the target task. Thus the pre-trained model act as a starting point for the model leading to faster convergence compared to the random initialization.
Fine-tuning embeddings is generally found to perform better than feature extraction. The shortcoming of this is only those words appearing in the training will have updated embedding while the embedding of unseen words will go stale.
This can affect performance when the training set is too small or the test contains a lot of out-of-vocabulary (OOV). To deal with OOV, most of the researchers nowadays use subword embedding models, such as ELMo and BERT.
While feature extraction and fine-tuning may look like two different approaches, they can be brought to a common framework. Let the pre-trained source model be defined in terms of parameters θs and Ls layers. Let the target parameter and layers be θt and Lt. Then the parameters of the adapted model are θA = θs ∪ θt with LA = Ls＋Lt.
layers where Ls and Lt contain layers in the intervals [1, Ls ] and (Ls , LA] respectively. The main parameter in the adaptation process is the learning rate η, which can differ according to layers as initially layers are general and do not require much changes while last layers are task-specific which require more changes.
η can also change while training if a schedule is used. η@adaptation is generally kept lower compared to η@pre-training in order to save the weights from changing too much. Let ηt(l) thus be the learning rate of the adapted model’s lth layer at iteration t. In this framework, feature extraction and fine-tuning can be defined as follows:
Feature extraction corresponds to the case where
where ∀l means for every layer
- Fine-tuning on the other hand, requires updating at least one of the source layers during adaptation:
where ∃l means there exists an l
The source layers can be trained in a fashion where only the last layers are trained (Long et al., 2015a). We can also have an unfreezing schedule such as chain-thaw on layers, Felbo et al., 2017. Howard and Ruder, 2018. experimented with gradual unfreezing schedule and got great results with their ULMFiT. Later Peters et al., 2019, find that the relative performance of fine-tuning vs. feature extraction of language models depend on the similarity of the pretraining and target tasks. We will discuss this later in section To tune or not to tune.
Adapting pretrained representations
Although MTL has become quite common, STL is the most popular technique at present. STL allows us to adopt pre-trained representations for any task by following a few steps and is also computationally less costly than MTL.
Universal language model fine-tuning(ULMFiT)
Inductive transfer learning has played a great role in computer vision but was unsuccessful when applied in NLP. Howard et. al. found that the problem didn’t exist in the idea of language model (LM) fine-tuning but how we approached the problem. Since LM are considerably shallow compared to computer vision (CV) models, it required a different kind of approach. They proposed ULMFiT which uses discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (STLR) to learn task-specific features. The classifier is fine-tuned on the target task using gradual unfreezing and STLR to preserve low-level representations and adapt high-level ones. Let’s discuss this in detail:
- Discriminative fine-tuning: As we know, the top layers are task-specific and lower layers capture general representations. We need smaller learning rates for lower layers as high learning rates change weights quickly and lead to catastrophic forgetting. Also, we would like to train the model as fast as possible. To work with these constraints we need a different learning rate for different layers such that it decreases as we go from top to bottom.
The input layer E — embedding layers, L — hidden layers with different learning rates and T — final layer.
- Gradual unfreezing: It was found empirically that training all layers at the same time on data of different distribution and task may lead to instability and poor solutions. Hence it was required to train layers individually to give them time to adapt to the new task and data. In lieu of this Long et al., ICML 2015 proposed freezing all layers except the top one. Felbo et al., EMNLP 2017 came up with a method called chain-thaw which recommended unfreezing one layer at a time and then keeping all trainable.
Recently Chronopoulou et al. proposed to fine-tune additional parameters for n epochs, pre-trained parameters without embedding layer for k epochs and then train all layers until convergence. ULMFiT proposed gradual unfreezing from top to down as shown in the fig 1.11.
First the last layer is unfrozen and gradually other layers are unfrozen to avoid catastrophic forgetting.
Main idea — Use appropriate learning rate to avoid over-writing useful information:
- Lower layers — capture general information
- Early in training — model still needs to adapt to target distribution
- Late in training — model is close to convergence
- Slanted triangular learning rates: Now that we know that we need different learning rate for each layer we need to find a suitable learning rate (LR) for every layer. Using the same LR or an annealed learning rate throughout training is not the best way to achieve this behaviour. Fig 1.12 shows the behaviours we get for different learning rates:
Smith, L. N. proposed cyclical learning rate (CLR) which gives us a way to find out the best LR for faster convergence without going too slow or noisy. To find the highest and lowest learning rates, run the training in mini-batches with increasing learning rate. Note the rate at which the loss starts increasing and that’s the maximum LR you can afford.
Fig 1.13 shows how you can train the model with a triangular rate schedule where it increases and decreases periodically:
You can have variations where the maximum rate keeps decreasing as we need a higher rate initially while a lower rate to converge to global minima.
ULMFiT worked on this idea and came up with STLR(Slanted triangular learning rate), which first linearly increases the learning rate and then linearly decays. As shown in fig 1.14, the ramp-up is faster than ramp-down which makes this even faster than CLR:
By allowing the learning rate to increase at times, we can jump out of sharp minima which would temporarily increase our loss but may ultimately lead to convergence on a more desirable minima. Additionally, increasing the learning rate can also allow for more rapid traversal of saddle point plateaus.
ULMFiT uses state-of-the-art language model ASGD Weight-Dropped-Long short-term memory (AWD-LSTM) [Merity et al., 2017a], a regular LSTM (with no attention, short-cut connections, or other sophisticated additions) with various tuned dropout hyperparameters. The following table shows the number of samples in different datasets:
We see that the TREC-6 and IMDB has relatively fewer samples for training.
- General-domain LM pretraining: To capture general features of the language in different layers. AWD-LSTM outperforms a vanilla LSTM language model due to its superior techniques.
- Target task LM fine-tuning: Full LM is fine-tuned on target task data using discriminative fine-tuning (‘Discr’) and STLR to learn task-specific features. Having task data LM tuning definitely helps in achieving better results especially when target data is less viz TREC-6.
- Target task classifier fine-tuning: Fine-tuned using gradual unfreezing with discriminative learning rate and STLR to preserve low-level representations and adapt high-level ones.
The smart learning rate schedule with discriminative learning not only gives a further boost in the accuracy by a smaller margin but also requires lesser epochs.
The following are a few denotations of the Classifier fine-tuning:
- Full — fine-tuning the full model
- Last — only fine-tuning the last layer
- Freez — gradual unfreezing
- Cos — aggressive cosine annealing schedule for triangular learning rate
Sequential Transfer Learning with fastai’s ULMFiT
We just completed going through the literature of transfer learning. Now let’s see try an example of sequential transfer learning with Howard’s fastai library:
Import the library
from fastai.text import *
Use the IMDB movie review dataset for training the model. The fastai library has built-in method for downloading and loading the data:
path = untar_data(URLs.IMDB_SAMPLE) df = pd.read_csv(path/'texts.csv') df.head()
The data consists of the true label, text and is_valid column which states whether the row will be used for validation.
Read the data which will be used for the LM (language model) fine tuning. The complete review text will be used for the task LM fine-tuning:
data_lm = TextLMDataBunch.from_csv(path, 'texts.csv')
Define the batch data loader which will generate text data batches for the training:
data_clas = TextClasDataBunch.from_csv(path, 'texts.csv', vocab=data_lm.train_ds.vocab, bs=32)
Fine-tune the LM for one STLR (slanted triangular learning rates) as discussed earlier:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5) learn.fit_one_cycle(1, 1e-2)
This will give the following accuracy result:
Fit it for one more cycle:
learn.unfreeze() learn.fit_one_cycle(1, 1e-3)
As you can see, the accuracy of LM has improved:
Now, to build the classifier, define the text classifier model using the existing AWD_LSTM model:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5) learn.load_encoder('ft_enc')
Train the model for one cycle:
This will give the following accuracy result:
Train one more cycle with start and end learning rates for layer groups (unfrozen layers) and the remaining are evenly geometrically spaced:
learn.freeze_to(-2) #Freeze till last 2 layers learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))
This will give the following accuracy result:
We can improve the result by training for more cycles. Finally, we test the model on a sample text.
learn.predict("This was a great movie!") (Category positive, tensor(1), tensor([0.0049, 0.9951]))
Now that you have learned the working of a transfer learning, let us discuss whether or not to tune your model in your transfer learning process in the next section.
To tune or not to tune 😆
As we discussed earlier, transfer learning can be majorly done in two ways:
- feature extraction
In feature extraction (EX) we get representation from the frozen model and pass it to the task model. While for fine-tuning (FT) we keep all the layers of the model unfrozen and train it for the task.
In EX we have the advantage of generating features once and try different models with it, saving valuable compute resource for retraining and experiment purpose. Alternatively, FT is great for either improving the model to be used again for many different tasks and making our work easier since we don’t have to experiment for any downstream model variations.
Peters et al., 2019, did an analysis of the effect of fine-tuning for both and came up with this advice. They compared two state-of-the-art pre-trained models, ELMo [Peters et al., 2018a] and BERT [Devlin et al., 2018] using both EX and FT across seven diverse tasks.
They find that both approaches achieve similar performance most of the time, but fine-tuning performs better when source and target tasks are similar, while feature extraction performs better when the source and target tasks are distant.
As we can see in the table, the behaviour of ELMo and BERT is different when it comes to EX Vs FT. ELMo FT always performs poorer compared to EX while BERT FT is better than EX.
One of the hypotheses behind the superior performance of BERT on the similarity task — ELMo uses LSTM which works sequentially, considering one token at a time, while BERT, which has a series of transformers with attention mechanism, consider the whole sequence at once. This helps BERT encode the sequence-pair interaction better than ELMo.
In order to surely eliminate catastrophic forgetting, the authors also experimented with gradual unfreezing of pre-trained layers. They observed that the model deteriorates as soon as they start training lower layers even when they control the learning rates for a smoother transfer.
We have reached a point in time where we cannot go back to not using transfer learning.
As NLP gains more traction and applicable to new problems, it will become crucial for us to find ways to leverage data from other domains, tasks and languages.
In this chapter, we covered various transfer learning techniques namely domain adaptation, cross-lingual learning, multi-task learning and sequential transfer learning which can help us make better machine learning models in lesser time and resources. We also looked at an example of sequential transfer learning and then discussed whether or not to tune our model in a transfer learning.
Neural Transfer Learning for Natural Language Processing by Sebastian Ruder
Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359.
Xia, R., Zong, C., Hu, X., and Cambria, E. (2015). Feature Ensemble plus Sample Selection: A Comprehensive Approach to Domain Adaptation for Sentiment Classification. Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Feature, PP(99):1.
Ruder, S., Ghaffari, P., and Breslin, J. G. (2017b). Knowledge Adaptation: Teaching to Adapt. In arXiv preprint arXiv:1702.02052.
Caruana, R. (1998). Multitask Learning. Autonomous Agents and Multi-Agent Systems, 27(1):95–133.
Baxter, J. (1997). A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28:7–39.
Søgaard, A. and Goldberg, Y. (2016). Deep multi-task learning with low level tasks supervised at lower layers. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 231–235.
Hashimoto, K., Xiong, C., Tsuruoka, Y., and Socher, R. (2017). A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. In Proceedings of EMNLP.
Sanh, V., Wolf, T., and Ruder, S. (2019). A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks. In Proceedings of AAAI 2019.
Kendall, A., Gal, Y., and Cipolla, R. (2018). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of CVPR 2018.
Xue, Y., Liao, X., Carin, L., and Krishnapuram, B. (2007). Multi-Task Learning for Classification with Dirichlet Process Priors. Journal of Machine Learning Research, 8:35–63.
Zhang, Z., Luo, P., Loy, C. C., and Tang, X. (2014). Facial Landmark Detection by Deep Multi-task Learning. In European Conference on Computer Vision, pages 94–108.
Liu, X., Gao, J., He, X., Deng, L., Duh, K., and Wang, Y.-Y. (2015). Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. NAACL-2015, pages 912–921.
Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., and Lehmann, S. (2017). Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of EMNLP.
Zoph, B., Yuret, D., May, J., and Knight, K. (2016). Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of EMNLP 2016.
Yang, J., Zhang, Y., and Dong, F. (2017a). Neural Word Segmentation with Rich Pretrain- ing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017).
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Subramanian, S., Trischler, A., Bengio, Y., and Pal, C. J. (2018). Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. In Proceedings of ICLR 2018.
Peters, M., Ruder, S., and Smith, N. A. (2019). To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks. arXiv preprint arXiv:1903.05987.
Howard, J. and Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. In Proceedings of ACL 2018.
Smith, L. N. (2017). Cyclical learning rates for training neural networks. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 464–472. IEEE.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI and NLP updates.
We’ll let you know when we release more in-depth technical education.