Natural language processing is a powerful tool, but in real-world we often come across tasks which suffer from data deficit and poor model generalisation. Transfer learning solved this problem by allowing us to take a pre-trained model of a task and use it for others.
Today, transfer learning is at the heart of language models like Embeddings from Language Models (ELMo) and Bidirectional Encoder Representations from Transformers (BERT) — which can be used for any downstream task. In this article, we will understand different types of transfer learning techniques and how they can be used to transfer knowledge to a different task, language or domain.
The following topics will be covered in this chapter:
Do you find this in-depth technical education about language models and NLP applications to be useful? Subscribe below to be updated when we release new relevant content.
Transfer learning
At present we have achieved fantastic results for many tasks like speech recognition, machine comprehension, object detection, and machine translation due to massive efforts of collecting data. These huge models are extremely data-hungry and require an immense amount of labelled data.
Most of the time there is a huge difference between these datasets available for training and its target application — for example training data available for speech recognition of USA but the target region is India.
In the last few years, we have learnt the art of making great machine learning models for audio, image, and text data with large amounts of data. Unfortunately every time we train a model on a dataset and try to use it for a different dataset, the performance deteriorates. This happens as the model fails to generalize and understand the basic patterns in the data. Even a slight difference can catch it off-guard.
This problem is important to tackle as the real-world data is ever-changing and it’s impractical to keep retraining model frequently from scratch for new scenarios. This is where transfer learning comes to the rescue.
Suppose we have a sentiment analysis task for the domain of Indian stock news. We have enough labelled data for this supervised task and train a model for it. Now when we apply the model to the task in the same domain, it will probably behave as expected. But, as soon as we apply it for the same task in another domain, such as cryptocurrencies, it will behave unexpectedly.
If we have to use the same paradigm of supervised learning, we need to collect labelled data for cryptocurrency news and train a new model. Hopefully, it will now perform great for cryptocurrency news, although who can predict crpytos! 😂
But what if we don’t have enough labelled data or cannot afford to collect it? Doesn’t it make sense to leverage the older model and train it for small labelled data we have for the cryptos?
And what if we want to add another prediction class to the model, such as neutral, while the earlier model was just trained to predict positive and negative? Transfer learning allows us to deal with these scenarios and use knowledge learned from a previous task/domain for a new one.
Let’s give a formal definition to transfer learning. Given a source domain Ds, a corresponding source task Ts, as well as a target domain Dt and a target task Tt, the objective of transfer learning now is to enable us to learn the target conditional probability distribution P(Yt|Xt) in Dt with the information gained from Ds and where Ds ≠Dt or Ts ≠ Tt.
In most cases, a limited number of labelled target examples, which is quite smaller than the number of labelled source examples are assumed to be available. The following figure illustrates the process of transfer learning:
Types of transfer learning
- Domain adaptation
- Cross-lingual learning
- Multi-task learning
- Sequential transfer learning
Based on domain and task there can be many variations which can be tackled using transfer learning:
- Xs ≠ Xt: The feature space of source and target is different. For example, the domain-specific words of stocks are different from those of cryptos. Word initial coin offering (ICO) which is specific to cryptos will never occur in the context of stocks. Also, if the languages differ, there can be a complete mismatch of feature space. This scenario is referred to as cross-lingual learning or cross-lingual adaptation.
- Ps(Xs) ≠ Pt(Xt): The marginal probability distribution of words is different for source and target. The word ledger will be used more frequently for cryptos while it will be used rarely for stocks. This scenario is generally known as domain adaptation.
- Ys ≠ Yt: Labels differ for source and target. Source had positive and negative labels while the target has neutral as well.
- Ps(Ys) ≠ Pt(Yt): The marginal probability distribution of labels is different for source and target. Positive labels occur more than negative in training but the market crashed and the real data has more negative labels than positive.
- Ps(Ys|Xs) ≠ Pt(Yt|Xt): The condition probability distribution of labels is different. This can happen if the same words can mean different or the data imbalance is different for source and target, for example, cold storage means totally different in the world of cryptos.
Now we define taxonomy as per Pan and Yang [2010]. They segregate transfer learning mainly into transductive and inductive. It is further divided into domain adaption, cross-lingual learning, multi-task learning and sequential transfer learning.
Domain adaptation
This is the most commonly occurring scenario in industry where we want to use a model trained on a task for a domain for another domain. Domain adaptation can be done with either no or minimal label data for target. Let us discuss the available approaches for this.
Representation approaches
Representation approaches try changing the underlying distribution of data by either finding features that are common in both domain or represent both data in a shared low-dimensional space:
- Distribution similarity approaches: The prime reasoning behind distribution similarity approaches is to make the source and target data distribution similar. A naive way to achieve similar distribution is to ignore features that do not occur in the target. Major of these approaches rely on a metric of similarity calculated by available measures such as:
- Kullback-Leibler (KL) divergence
- Jensen-Shannon (JS) divergence
- Wasserstein distance
For distributional similarity approaches, a common strategy is to use a representation that minimizes the distance between the representations of the two domains, while at the same time maximizing the performance on the source domain data.
- Latent feature learning: Latent feature learning method try to represent data in a lower dimensional feature space such that it increases similarity with both source and target. The lower dimensional space can be learnt by either a factorisation algorithm like Singular Value Decomposition(SVD) or a neural network autoencoder.
Weighting and selecting data
In this methodology, we try to weigh and select instances instead of choosing features to maximize representation of source as well as target data. Instance weighting approach can be seen as a soft selection, while selecting instances can be treated as hard selection. Instance selection is more efficient as well as allows to neglect examples which can be harmful.
Xia et al. 2015 used PCA to represent the data and then used a distance metric of instances from PCA space to select them. Ruder et al. 2017b used maximum cluster difference (MCD) to define the similarity of an instance to the class. In both instances, weighting and selection are used as a pre-processing steps to select the most useful examples in NLP pipeline.
Self-labelling approaches
Self-labelling approaches belong to the category of semi-supervised learning where we train a model on labelled data and then use it to assign pseudo labels to unlabelled examples. Then, these examples are used to train the model again. Self-labelling approaches follow either of the two training types:
- Self-training: In this approach, as we defined earlier we try to choose only those examples with higher confidence to be added to the training data. The main downside of this approach is model’s inability to correct its own mistakes and it can get worse with time. It can also use a weighting scheme to give different weights to actual labels and pseudo labels.
- Multi-view training: In Multi-view training, we train different models with different views of the data. These pipelines can differ in different features of the data or different model architectures or data. This can be accomplished in many ways:
- Co-training: After training models on their feature sets, only those instances where a model is confident is moved to the training set of the others. In this way we can get labelled data for a model which is uncertain of the instance.
- Democratic co-learning: This method use models with different inductive biases with either different neural network architectures or algorithms.
- Tri-training: This is similar to Democratic co-learning, where we use 3 different models with their inductive bias and train them on different variations of the original training data using bootstrap sampling. After they are trained, we add an unlabelled data to the training sample if any two models agree with predicted label.
Multi-source domain adaptation
In Multi-source domain adaptation, we leverage data available in different domains to make models and use a combinatory approach:
- Combining source models: Either we can train a single model by combining training data of all sources or we can train models for individual data and average it or make an ensemble. Then these pseudo labels are used to train a new model.
- Neural network-based methods: This method use source model to find lower dimensional representations of target instances and then train subsequent layers with attention mechanisms or recurrent neural network layers. It can also use a weighting scheme as per similarity of source domain with target domain to weight outputs of models and fine-tune subsequent layers.
Cross-lingual learning
In this section, we will try to understand cross-lingual language models which enable us to compare words across different languages which is important for tasks like machine translation and cross-lingual retrieval. But more importantly, these embeddings can help us transfer knowledge from resource-rich to resource-poor languages by providing a common represent space.
The data for the task can exist in either parallel form which is an exact conversion (cat translation in Fig 1.4) or it can be in a comparable form where an example exists in the form of nearby image(word for a similar cat image).
Let’s discuss the three types of alignments used to learn cross-lingual word embeddings:
- Word-level alignment: This approach use dictionaries containing word-pairs in different languages. This is the most commonly used approach and can also use other modalities like images.
- Sentence-level alignment: This approach use sentence pairs which are similar to those used for making machine translation systems. They typically use Europarl corpus which is a sentence-aligned corpus of proceedings of European parliament.
- Document-level alignment: This approach requires parallel documents which have aligned translated sentences. As it’s rare to get such documents, comparable documents are used more often. Such data can be created using topics of Wikipedia and gathering data in different languages.
Multi-task learning
Usually, we train the model for only one task. But by this, we can lose out on the information which can help the model perform better. Now if we train the model for multiple tasks, it might be able to generalize better by sharing representations for all tasks.
Multi-task learning (MTL) is also known as joint learning and whenever we try to optimize more than one loss function we are practically doing MTL.
MTL improves generalization by leveraging the domain-specific information contained in the training signals of related tasks” — Caruana, 1998
The beauty of multi-task learning comes from using the same parameters for different tasks which brings us to the concept of hard and soft parameter sharing.
Hard parameter sharing
This is the most commonly used MTL method. In this the hidden layers are shared between all the tasks while the task-specific layers are kept separate as shown in the following figure:
Baxter showed that hard parameter sharing reduces the chances of overfitting by order of T as the shared parameter has to learn representations which are common to many tasks by generalising.
Soft parameter sharing
In this approach, each task has its own model and parameters. The distance between the parameters of the model is then regularized in order to encourage the parameters to be similar as shown in the following figure:
Why multi-task learning works?
It makes intuitive sense for MTL to give us a superior model and the reasoning of the methodology can be understood by the following advantages:
- Implicit data augmentation: Effectively, MTL increases the training data for our model. As all tasks have noisy data, the model has to learn a representation which ignores the data-dependent noise. As different data have different noise patterns, the model has to learn a general representation which works out best for all tasks. The joint learning averages out the noise patterns and leads to a better representation.
- Attention focusing: It can get very difficult to train model if the data is very noisy or high dimensional or limited. Training on multiple task can teach the model to focus on the most relevant features and can lead to a better model.
- Representation bias: MTL forces the model to learn representations which are useful for all tasks. This helps the model to generalize faster for all tasks in the future as the representation which works for many tasks will also work for a new one.
- Regularization: MTL acts as a regularizer (reduces over-fitting) by introducing inductive bias and reduces Rademacher complexity of the model, which is its ability to fit random noise [Søgaard and Goldberg, 2016].
MTL considerations
MTL is used in situations where we want predictions of multiple tasks at once. Let’s go through the considerations to be taken for best interaction between main and auxiliary tasks.
Shared layers
As the layers can be affected adversely by auxiliary tasks, we need to consider which layers are actually worth sharing. Søgaard and Goldberg [2016] found that when the main task has auxiliary tasks like named entity recognition (NER) or part-of-speech tagging (POS) tagging, it makes sense to share lower layers. Based on this, Hashimoto et al. [2017] made a hierarchical architecture which consisted of several tasks for joint modelling.
Sanh et al., 2019, proposed a hierarchical architecture for semantic tasks. The model is trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low-level tasks at the bottom layers of the model and more complex tasks at the top layers of the model as shown in Fig 1.7. The tasks share common embeddings and encoders allowing an easy information flow from the lowest level to the top of the architecture.
The model achieves state-of-the-art results on the tasks of Named Entity Recognition, Entity Mention Detection and Relation Extraction and competitive results on Coreference Resolution while using simpler training and regularization procedures than previous works.
Task interactions
In MTL, normally, the batches are sampled uniformly from tasks. As the optimizer tries to minimize the weighted sum of loss during the training, it becomes important to find good weights. These weights can be tuned on a validation set like any other hyper-parameter.
A common approach is to give equal weight but a more sophisticated approach can be taken where the weight itself can be learnt as shown by Kendall et al., 2018.
Usually, different tasks have a different number of samples and the optimizer will optimize for the task which has maximum samples.
To work around this problem we can also sample with different probabilities from tasks which is inversely proportional to the number of samples so to get an equal number of training samples from each task. We can also sample more from the main task to give it more importance.
Adjusting the sampling ratio of different tasks has the same effect as assigning different weights.
Auxiliary task selection
The fundamental assumption behind using an auxiliary task is that it will be related to the main task and can help the main task. The relatedness of a task can be found in many ways. One of the ways is to know if main and auxiliary tasks use the same features (low-level information) for prediction.
Xue et al. [2007] argued that two similar tasks will share similar classification boundaries. The common type of auxiliary tasks are:
- Statistical: These tasks try to predict low-level information about the input data itself such as log frequency of a word.
- Selective unsupervised: These tasks selectively try to predict a certain part of the input data. For sentiment analysis, Yu and Jiang [2016] predict whether the sentence contains a positive or negative domain-independent sentiment word, which sensitizes the model towards the sentiment of the words in the sentence.
- Supervised tasks: This is the most common use case where we take a supervised task. Zhang et al. [2014] used head pose estimation and facial attribute inference as auxiliary tasks for facial landmark detection; Liu et al. [2015] jointly learning query classification and web search.
- Unsupervised tasks: The auxiliary tasks discussed so far are similar to the original task and learn representations which are common to both. But we can also train a model with an unsupervised task to induce general-purpose representation such as language modelling.
Related tasks in NLP
Sequential transfer learning
As the name implies, sequential transfer learning (STL) involves transferring knowledge with a sequence of steps, where the source and target task are not necessarily similar. Unlike MTL where the tasks are learnt jointly STL consists of two stages. In the first phase of pretraining, the model is trained on source data and in the second phase of adaptation, the source model is trained for target task.
The pretraining task is usually costly but is only performed once. The adaption task is usually faster as it acts like a fine-tuning step.
STL is useful in these three scenarios:
- Source and target task data is not available at the same time
- Source task has more data than the target task
- Adaptation to many target tasks is required
STL looks similar to MTL but is very different in the way knowledge transfer takes place. In MTL, both the source and target are trained together while in STL, first the source is trained and later target is trained:
Pretraining
To get the maximum benefit we want to have a source training which will benefit many target tasks. It’s difficult to find such a task in practice but it always turns out better than starting from scratch. Now, let’s discuss a bit about source training which can be accomplished in three ways:
- Distant supervision
Distant supervision uses data obtained from heuristics and domain expertise. Such data is often noisy and obtained using predefined patterns. Felbo et al. [2017] used distant supervision to predict a large number of emojis on more than a billion tweets. Later they apply their pre-trained model not only to sentiment analysis, but also to emotion and sarcasm detection tasks, demonstrating that a specialized pretraining task can be useful for an array of related target tasks.
- Traditional supervision
Traditional supervision requires manually labelled training data. This method can leverage a lot of commonly available datasets although a suitable task data is preferable. Zoph et al. [2016] train a machine translation model on a high-resource language pair and then transfer this model to a low-resource language pair. Yang et al. [2017a] pre-train a POS tagging model and apply it to word segmentation.
Nowadays researchers try to choose a task which requires basic understanding of the language. Such tasks include predicting the meaning of a word and image captioning. While it’s tempting to go for a large dataset to get maximum knowledge, the value of the pre-trained model depends on the similarity of source and target domain and task.
- No supervision
Unsupervised learning is the easiest way to train the source model as it only requires access to a large unlabelled text. It is also known as language modelling. Compared to supervised learning, it’s much more scalable approach as text for any domain is easily available. This approach captures much more general knowledge about the language in comparison to supervised learning which captures only those features required for the task.
Various approached have been tried to learn these representations which include Latent Semantic Analysis (LSA), Latent Dirichlet allocation (LDA), Skip-gram with negative sampling (SGNS), Global vectors (GloVe), Skip-thoughts, ELMo and BERT.
- Multi-task pretraining
To leverage the advantages of the preceding three methods we can also use MTL where all tasks can be trained jointly. MTL can help these representations generalise and make them useful for different downstream tasks. Subramanian et al. [2018] perform multi-task pretraining on skip-thoughts, machine translation, constituency parsing, and natural language inference.
Adaptation
We just completed discussing the first step of STL and now we will touch the second step — adaptation. Currently, there are two approaches of using a pre-trained model for the target task — feature extraction and fine-tuning. Feature extraction uses the representations of a pre-trained model and feeds it to another model while fine-tuning involves training of the pre-trained model on target task.
- Feature extraction: In feature extraction, the model weights are frozen and the output from it is directly sent to another model. The features can either be sent to a fully connected model or we can also train a classical model like Support Vector Machine (SVM) or RandomForest on it. The benefit of using this is the task-specific model can be used again for similar data. Also, if the same data is used repeatedly, extracting feature once can save a lot of computing resources.
- Fine-tuning: In fine-tuning, as the name implies, the weights are kept trainable and are fine-tuned for the target task. Thus the pre-trained model act as a starting point for the model leading to faster convergence compared to the random initialization.
Fine-tuning embeddings is generally found to perform better than feature extraction. The shortcoming of this is only those words appearing in the training will have updated embedding while the embedding of unseen words will go stale.
This can affect performance when the training set is too small or the test contains a lot of out-of-vocabulary (OOV). To deal with OOV, most of the researchers nowadays use subword embedding models, such as ELMo and BERT.
While feature extraction and fine-tuning may look like two different approaches, they can be brought to a common framework. Let the pre-trained source model be defined in terms of parameters θs and Ls layers. Let the target parameter and layers be θt and Lt. Then the parameters of the adapted model are θA = θs ∪ θt with LA = Ls+Lt.
layers where Ls and Lt contain layers in the intervals [1, Ls ] and (Ls , LA] respectively. The main parameter in the adaptation process is the learning rate η, which can differ according to layers as initially layers are general and do not require much changes while last layers are task-specific which require more changes.
η can also change while training if a schedule is used. η@adaptation is generally kept lower compared to η@pre-training in order to save the weights from changing too much. Let ηt(l) thus be the learning rate of the adapted model’s lth layer at iteration t. In this framework, feature extraction and fine-tuning can be defined as follows:
Feature extraction corresponds to the case where
where ∀l means for every layer
- Fine-tuning on the other hand, requires updating at least one of the source layers during adaptation:
where ∃l means there exists an l
The source layers can be trained in a fashion where only the last layers are trained (Long et al., 2015a). We can also have an unfreezing schedule such as chain-thaw on layers, Felbo et al., 2017. Howard and Ruder, 2018. experimented with gradual unfreezing schedule and got great results with their ULMFiT. Later Peters et al., 2019, find that the relative performance of fine-tuning vs. feature extraction of language models depend on the similarity of the pretraining and target tasks. We will discuss this later in section To tune or not to tune.
Adapting pretrained representations
Although MTL has become quite common, STL is the most popular technique at present. STL allows us to adopt pre-trained representations for any task by following a few steps and is also computationally less costly than MTL.
Universal language model fine-tuning(ULMFiT)
Inductive transfer learning has played a great role in computer vision but was unsuccessful when applied in NLP. Howard et. al. found that the problem didn’t exist in the idea of language model (LM) fine-tuning but how we approached the problem. Since LM are considerably shallow compared to computer vision (CV) models, it required a different kind of approach. They proposed ULMFiT which uses discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (STLR) to learn task-specific features. The classifier is fine-tuned on the target task using gradual unfreezing and STLR to preserve low-level representations and adapt high-level ones. Let’s discuss this in detail:
- Discriminative fine-tuning: As we know, the top layers are task-specific and lower layers capture general representations. We need smaller learning rates for lower layers as high learning rates change weights quickly and lead to catastrophic forgetting. Also, we would like to train the model as fast as possible. To work with these constraints we need a different learning rate for different layers such that it decreases as we go from top to bottom.
The input layer E — embedding layers, L — hidden layers with different learning rates and T — final layer.
- Gradual unfreezing: It was found empirically that training all layers at the same time on data of different distribution and task may lead to instability and poor solutions. Hence it was required to train layers individually to give them time to adapt to the new task and data. In lieu of this Long et al., ICML 2015 proposed freezing all layers except the top one. Felbo et al., EMNLP 2017 came up with a method called chain-thaw which recommended unfreezing one layer at a time and then keeping all trainable.
Recently Chronopoulou et al. proposed to fine-tune additional parameters for n epochs, pre-trained parameters without embedding layer for k epochs and then train all layers until convergence. ULMFiT proposed gradual unfreezing from top to down as shown in the fig 1.11.
First the last layer is unfrozen and gradually other layers are unfrozen to avoid catastrophic forgetting.
Main idea — Use appropriate learning rate to avoid over-writing useful information:
- Lower layers — capture general information
- Early in training — model still needs to adapt to target distribution
- Late in training — model is close to convergence
- Slanted triangular learning rates: Now that we know that we need different learning rate for each layer we need to find a suitable learning rate (LR) for every layer. Using the same LR or an annealed learning rate throughout training is not the best way to achieve this behaviour. Fig 1.12 shows the behaviours we get for different learning rates:
Smith, L. N. proposed cyclical learning rate (CLR) which gives us a way to find out the best LR for faster convergence without going too slow or noisy. To find the highest and lowest learning rates, run the training in mini-batches with increasing learning rate. Note the rate at which the loss starts increasing and that’s the maximum LR you can afford.
Fig 1.13 shows how you can train the model with a triangular rate schedule where it increases and decreases periodically:
You can have variations where the maximum rate keeps decreasing as we need a higher rate initially while a lower rate to converge to global minima.
ULMFiT worked on this idea and came up with STLR(Slanted triangular learning rate), which first linearly increases the learning rate and then linearly decays. As shown in fig 1.14, the ramp-up is faster than ramp-down which makes this even faster than CLR:
By allowing the learning rate to increase at times, we can jump out of sharp minima which would temporarily increase our loss but may ultimately lead to convergence on a more desirable minima. Additionally, increasing the learning rate can also allow for more rapid traversal of saddle point plateaus.
ULMFiT uses state-of-the-art language model ASGD Weight-Dropped-Long short-term memory (AWD-LSTM) [Merity et al., 2017a], a regular LSTM (with no attention, short-cut connections, or other sophisticated additions) with various tuned dropout hyperparameters. The following table shows the number of samples in different datasets:
We see that the TREC-6 and IMDB has relatively fewer samples for training.
- General-domain LM pretraining: To capture general features of the language in different layers. AWD-LSTM outperforms a vanilla LSTM language model due to its superior techniques.
- Target task LM fine-tuning: Full LM is fine-tuned on target task data using discriminative fine-tuning (‘Discr’) and STLR to learn task-specific features. Having task data LM tuning definitely helps in achieving better results especially when target data is less viz TREC-6.
- Target task classifier fine-tuning: Fine-tuned using gradual unfreezing with discriminative learning rate and STLR to preserve low-level representations and adapt high-level ones.
The smart learning rate schedule with discriminative learning not only gives a further boost in the accuracy by a smaller margin but also requires lesser epochs.
The following are a few denotations of the Classifier fine-tuning:
- Full — fine-tuning the full model
- Last — only fine-tuning the last layer
- Freez — gradual unfreezing
- Cos — aggressive cosine annealing schedule for triangular learning rate
Sequential Transfer Learning with fastai’s ULMFiT
We just completed going through the literature of transfer learning. Now let’s see try an example of sequential transfer learning with Howard’s fastai library:
Import the library
from fastai.text import *
Use the IMDB movie review dataset for training the model. The fastai library has built-in method for downloading and loading the data:
path = untar_data(URLs.IMDB_SAMPLE) df = pd.read_csv(path/'texts.csv') df.head()
The data consists of the true label, text and is_valid column which states whether the row will be used for validation.
Read the data which will be used for the LM (language model) fine tuning. The complete review text will be used for the task LM fine-tuning:
data_lm = TextLMDataBunch.from_csv(path, 'texts.csv')
Define the batch data loader which will generate text data batches for the training:
data_clas = TextClasDataBunch.from_csv(path, 'texts.csv', vocab=data_lm.train_ds.vocab, bs=32)
Fine-tune the LM for one STLR (slanted triangular learning rates) as discussed earlier:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5) learn.fit_one_cycle(1, 1e-2)
This will give the following accuracy result:
Fit it for one more cycle:
learn.unfreeze() learn.fit_one_cycle(1, 1e-3)
As you can see, the accuracy of LM has improved:
Now, to build the classifier, define the text classifier model using the existing AWD_LSTM model:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5) learn.load_encoder('ft_enc')
Train the model for one cycle:
learn.fit_one_cycle(1, 1e-2)
This will give the following accuracy result:
Train one more cycle with start and end learning rates for layer groups (unfrozen layers) and the remaining are evenly geometrically spaced:
learn.freeze_to(-2) #Freeze till last 2 layers learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))
This will give the following accuracy result:
We can improve the result by training for more cycles. Finally, we test the model on a sample text.
learn.predict("This was a great movie!") (Category positive, tensor(1), tensor([0.0049, 0.9951]))
Now that you have learned the working of a transfer learning, let us discuss whether or not to tune your model in your transfer learning process in the next section.
To tune or not to tune 😆
As we discussed earlier, transfer learning can be majorly done in two ways:
- feature extraction
- fine-tuning
In feature extraction (EX) we get representation from the frozen model and pass it to the task model. While for fine-tuning (FT) we keep all the layers of the model unfrozen and train it for the task.
In EX we have the advantage of generating features once and try different models with it, saving valuable compute resource for retraining and experiment purpose. Alternatively, FT is great for either improving the model to be used again for many different tasks and making our work easier since we don’t have to experiment for any downstream model variations.
Peters et al., 2019, did an analysis of the effect of fine-tuning for both and came up with this advice. They compared two state-of-the-art pre-trained models, ELMo [Peters et al., 2018a] and BERT [Devlin et al., 2018] using both EX and FT across seven diverse tasks.
They find that both approaches achieve similar performance most of the time, but fine-tuning performs better when source and target tasks are similar, while feature extraction performs better when the source and target tasks are distant.
As we can see in the table, the behaviour of ELMo and BERT is different when it comes to EX Vs FT. ELMo FT always performs poorer compared to EX while BERT FT is better than EX.
One of the hypotheses behind the superior performance of BERT on the similarity task — ELMo uses LSTM which works sequentially, considering one token at a time, while BERT, which has a series of transformers with attention mechanism, consider the whole sequence at once. This helps BERT encode the sequence-pair interaction better than ELMo.
In order to surely eliminate catastrophic forgetting, the authors also experimented with gradual unfreezing of pre-trained layers. They observed that the model deteriorates as soon as they start training lower layers even when they control the learning rates for a smoother transfer.
Conclusion
We have reached a point in time where we cannot go back to not using transfer learning.
As NLP gains more traction and applicable to new problems, it will become crucial for us to find ways to leverage data from other domains, tasks and languages.
In this chapter, we covered various transfer learning techniques namely domain adaptation, cross-lingual learning, multi-task learning and sequential transfer learning which can help us make better machine learning models in lesser time and resources. We also looked at an example of sequential transfer learning and then discussed whether or not to tune our model in a transfer learning.
Further reading
Neural Transfer Learning for Natural Language Processing by Sebastian Ruder
Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359.
Xia, R., Zong, C., Hu, X., and Cambria, E. (2015). Feature Ensemble plus Sample Selection: A Comprehensive Approach to Domain Adaptation for Sentiment Classification. Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Feature, PP(99):1.
Ruder, S., Ghaffari, P., and Breslin, J. G. (2017b). Knowledge Adaptation: Teaching to Adapt. In arXiv preprint arXiv:1702.02052.
Caruana, R. (1998). Multitask Learning. Autonomous Agents and Multi-Agent Systems, 27(1):95–133.
Baxter, J. (1997). A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28:7–39.
Søgaard, A. and Goldberg, Y. (2016). Deep multi-task learning with low level tasks supervised at lower layers. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 231–235.
Hashimoto, K., Xiong, C., Tsuruoka, Y., and Socher, R. (2017). A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. In Proceedings of EMNLP.
Sanh, V., Wolf, T., and Ruder, S. (2019). A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks. In Proceedings of AAAI 2019.
Kendall, A., Gal, Y., and Cipolla, R. (2018). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of CVPR 2018.
Xue, Y., Liao, X., Carin, L., and Krishnapuram, B. (2007). Multi-Task Learning for Classification with Dirichlet Process Priors. Journal of Machine Learning Research, 8:35–63.
Zhang, Z., Luo, P., Loy, C. C., and Tang, X. (2014). Facial Landmark Detection by Deep Multi-task Learning. In European Conference on Computer Vision, pages 94–108.
Liu, X., Gao, J., He, X., Deng, L., Duh, K., and Wang, Y.-Y. (2015). Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. NAACL-2015, pages 912–921.
Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., and Lehmann, S. (2017). Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of EMNLP.
Zoph, B., Yuret, D., May, J., and Knight, K. (2016). Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of EMNLP 2016.
Yang, J., Zhang, Y., and Dong, F. (2017a). Neural Word Segmentation with Rich Pretrain- ing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017).
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Subramanian, S., Trischler, A., Bengio, Y., and Pal, C. J. (2018). Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. In Proceedings of ICLR 2018.
Peters, M., Ruder, S., and Smith, N. A. (2019). To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks. arXiv preprint arXiv:1903.05987.
Howard, J. and Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. In Proceedings of ACL 2018.
Smith, L. N. (2017). Cyclical learning rates for training neural networks. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 464–472. IEEE.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI and NLP updates.
We’ll let you know when we release more in-depth technical education.
dxfgbxkxgj says
An Ultimate Guide To Transfer Learning In NLP
dxfgbxkxgj http://www.guta525ihj4mp8b85v35z67o40u0ma75s.org/
adxfgbxkxgj
[url=http://www.guta525ihj4mp8b85v35z67o40u0ma75s.org/]udxfgbxkxgj[/url]
Endospheres Inner Ball Body Slimming Machine says
Velashape III Cellulite Removal Machine
Parts Of Generator
1000 Kva Perkins Diesel Generator
Silent Inverter Generator
7D LIPOSONIX Vmax HIFU 3in1 Machine
4D 7D 2in1 HIFU Face Lifting Machine
8kw Diesel Generator
Diesel Generator For Home
Trusculpt 3D ID Cellulite Removal Machine
Endospheres Inner Ball Body Slimming Machine
Woodworking automatic edge banding machine says
Milk Tencel Twill Fabric
Aluminum strip sawing
Tencel Bamboo Fabric
1200Y Rotary Wood CNC
60%Tencel Fabric
4 axis wood cnc router
Tencel Lyocell Shirt
Woodworking panel saw
Black Tencel Fabric
Woodworking automatic edge banding machine
Double new energy Bus says
Kitchen Accessories Cast Iron Frying Pan
Colorful Cast Iron Cookware
Cast Iron Casserole Set
Double new energy buses
Double sightseeing buses
Bottle Spray
Double-decker new energy Bus
Camping Cast Iron Pan
Double-decker new energy buses
Double new energy Bus
Firman Generator Carburetor says
Electric van
14 Kva Diesel Generator
Electric Minavan
New energy van
Electric vans
Diesel Generator Cost
Diesel Generator Control Panel
82.5 Kva Generator
Electric Minavans
Firman Generator Carburetor
Uvfs says
Borosilicate Crown Glass
18m Bus
Transparent
D07R Logistics Vehicles
10.2m Double Bus
12.3m Double Bus
Low Fluorescence
Crown Glass
D07 Logistics Vehicles
Uvfs
48V 4000W Frequency Sine Wave Inverter says
48V 1000W Frequency Sine Wave Inverter
48V 3000W Frequency Sine Wave Inverter
Multi Color High Quality Mens Clothing Classic Round Neck Plain Pea Green Custom Oversized T Shirts
New Style White Black Oversized Drop Shoulder T Shirt Tee Manufacturer Summer Plain Cotton Men Tshirt
Wholesale High Quality Cotton High Street Oversized Summer T Shirt For Men Plus Size Mens T Shirts
48V 2000W Frequency Sine Wave Inverter
http://www.partenariat-francais-eau.fr
Hot Rod 4 Link
Men's 100% Cotton Custom Logo Design Blank Washed T Shirts High Quality Oversized Drop Shoulder T-shirt For Men
Tin Box Biscuits
Tiffin Tins
Tin Jewelry Box
Rod Links
24V 6000W Frequency Sine Wave Inverter
Oem Wholesale 95% Cotton 5% Elastane Mens Slim Fit Tee Shirt Custom Longer Drop High Quality Mens Tshirt
48V 4000W Frequency Sine Wave Inverter
Big Ice Maker Machine says
Customized Color Aerosol Paint
Colored Aerosol Auto-Spray Paint
Commercial Ice Maker With Bin
Colored Aerosol Paint
Trendsetting Men's Black Short Sleeve Hoodie With Oversized Pocket Streetwear Inspired Comfortable Cotton
radiolom.kyiv.ua
Light Grey Men's Zip-up Hoodie Soft Cotton Blend Casual Streetwear Comfortable Classic Style High-quality Modern Tailored Fit
Fluorescent Paint
Modern Men's Hoodie In Warm Beige Artisan Quilted Design Comfortable Cotton For Seasonal Layering
10 Ton Flake Ice Machine For Sale
Walk In Cold Room
Camel Brown Cotton Hoodie Men's Casual Soft-touch Pullover Heavyweight Warm Comfortable Classic Streetwear
Golden Aerosol Paint
Cold Room Refrigeration
Men's Black Hoodie With Unique Zipper Detail Premium Cotton Streetwear Cozy & Artisan Crafted Innovative
Big Ice Maker Machine
Multi Functional Anti Rust Lubricant says
Manufacturer Logo Custom Print Women's Tee Shirt High Quality Graphic T Shirt Women Graphic T Shirts Oversized Women
Brightening Agent for Stainless Steel
Tragus Earrings
Summer Cotton High Quality European And American Style Girls Printed Round Neck Short Sleeve T-shirt
Hot Sale New Design Daily Women's T-shirt Round Neck Short Sleeve Custom Multi Color Women's T-shirts
http://www.partenariat-francais-eau.fr
Hot Sale Customization Slim Fit Summer Women Tshirt Custom Design Wholesale Womens Crop Top T-shirt
Rust Conversion Spray
Good Quality Yellow Color Women's T-shirts Crop Top Blank Multi Color Wholesale Women's T-shirts
Paint Remover
Female Body Piercing
Circuit Board Protective Paint
Piercing Manufacture
Belly Button Piercing Kit
Belly Ring Hoop
Multi Functional Anti Rust Lubricant
that's not my neighbor says
Not only is it a place to find information, this blog also brings me joy in reading. I learned many interesting things and even tested some ideas after reading the article here.
Men's Light Grey Hooded Pullover Cozy Soft-touch High-quality Cotton Rich Fabric Modern Relaxed Fit Casual Streetwear Hoodie says
Dark Grey Cotton Hoodie Men's Casual Streetwear Pullover Soft-touch Comfort Heavyweight High-quality Contrast Hood Lining
Nrcan Energy Efficiency
accentdladzieci.pl
Saa Certification
Olive Green Heavyweight Hoodie For Men With Comfortable Cotton Blend Streetwear Casual Pullover High-quality Soft-touch
Stylish Men's Black Distressed Hoodie Urban Streetwear Comfortable High-quality Cotton Rich Oversized Pullover Soft-touch
Double-decker new energy Bus
Ce Medical Device
Nom Certification
Double-decker new energy buses
Men's Beige Zip-up Hoodie Premium Cotton Classic Stylish Casualwear Functional Comfortable Durable Urban
Ce Rohs Certification
Double new energy buses
Double-decker electric buses
Double new energy Bus
Men's Light Grey Hooded Pullover Cozy Soft-touch High-quality Cotton Rich Fabric Modern Relaxed Fit Casual Streetwear Hoodie
Recycled Notebook says
Soft Cover Leather Planner
Men's Casual Olive Green Hoodie High-quality Anti-shrink Fabric
Construction Hoist Elevator
Spiral Planner
Special Purpose Rack & Pinion Elevator
Leather Journal Diary Notebook
Construction Hoist With Platform
orden.coulot.info
Mobile Tower Crane
Travel Journal Notebooks
Black Full Zip Up Hoodie Men's Heavyweight Streetwear
Boom Pump
Men's Blue Fuzzy Hoodie Winter Warm Soft Fabric Streetwear
Men Oversized Plain With Logo Pocket High Quality Cotton Premium Custom Color Cotton Men T-shirt With Pocket
Men's Heather Grey Hoodie Classic Cotton Blend Pullover With Hood
Recycled Notebook
Triple Barrel Hair Curler says
Dlc Blade Hair Clipper Professional
High Quality Manufacturer 100% Cotton New Style Dark Green Black Multi-color Simple Hoodie Men Hoodie
Dapoxetine
retrolike.net
New energy minivans
Electric minivans
Passenger Vehicles
Oem Workout V Neck Women Cropped Hoodie And Jogger Women Heavyweight Custom Hoodies Embroidered
Luxury Designer Manufacturer Hoodie Sweatshirt Women Sublimation Women's Clothing Winter Hoodies For Women
Methenolone Enanthate
Bldc Motor Barber Hair Clipper
Factory Direct Support No String Pocket Pullover Sweatshirt Heavy Weight Pullover Acid Wash Men Hoodies
New energy minivan
Pure electric minivans
High Quality Wholesale Oversized Sports Men's Custom Embossed Tshirts Cotton Vintage Rock T-shirt Men
Triple Barrel Hair Curler
24V 2000W-140A MPPT Solar Inverter says
24V 1000W-140A MPPT Solar Inverter
Oversized Graphic Hoodie Men's Cotton Screen Printing Logo Hoodie Mens
Digital Air Compressor
Anime Graphic Hoodie Men's Cotton Oversized Fit With Pockets
Men Tracksuit Style Hoodie With Logo Band Design 100% Cotton 3d Embossed Hoodie Men Oversized Hoodie
Puff Printed Hoodies Men's Streetwear Essentials Mens Puff Printing Zip Up Hoodies Unisex
24V 4000W-140A MPPT Solar Inverter
Vertical Compressor
Ceramic Dinnerware Sets
Ceramic Espresso Cups
Men's Tracksuit Cotton Hoodie With Zip Up Design
kormakhv.ru
24V 3000W-140A MPPT Solar Inverter
Ceramic Espresso Cups
24V 5000W-140A MPPT Solar Inverter
24V 2000W-140A MPPT Solar Inverter
Enjoy Life says
LCD screen
Olive Green Half-zip Hoodie Men's Cotton Comfortable Pullover Casual Streetwear Style Heavyweight Soft Touch Modern Classic
display screen module
Men's Grey Cotton Hoodie Soft & Comfortable Everyday Wear Stylish & Relaxed Streetwear Pullover Eco-friendly
Childrens Hanging Chair
Urban Style Men's Distressed Gray Hoodie With Custom Logo 2024 Premium Cotton Sweatshirt Cozy Pullover Essential Streetwear
Folding Swing Chair
vertical lcd display module
8 inch Tft Display
Childrens Hanging Chair
display module
Men's Trendy Hoodie With Floral Design And Contrast Pocket Artisan Crafted Soft Cotton Blend For Outdoor Hiking
Patio Dining Table Set
Men's 100% Cotton Hooded Sweatshirt Oversize Pullover Custom Print Blank Drop Shoulder Hoodies Heavyweight Hoodie For Men
http://www.rccgvic.com
Enjoy Life
Women's Casual Hoodies Baby Girl Slogan Cropped Pullover High-quality Cotton Pink Fashion Streetwear Sweatshirt says
12864 Lcd
128×64 Graphic LCD Module
katowice.misiniec.pl
Graphic LCD 128×64
Windshield Sealant
Polyurethane Liquid Rubber
Casual Women's Warm Pullover With Slogan High-quality Fleece Oversized Sweatshirt Cozy Loungewear Street Style
Women's Cotton Hoodie Sets Casual Oversized High-quality Zip-up Hoodie And Joggers Set For Comfortable Tracksuit
Colour Silicone
Graphic LCD Module
Stylish Zip-up Women's Hoodie Casual High-quality Cotton Top Athleisure Streetwear Hoodie Solid Color Zipper Pull
Women's Hoodies Sweatshirts Oversized Fleece Hoodie Black Embossed Pullover High-quality Warm Unisex Fashion
Neutral Cure Silicone
12864 Graphic LCD Gisplay
Multi Purpose Silicone
Women's Casual Hoodies Baby Girl Slogan Cropped Pullover High-quality Cotton Pink Fashion Streetwear Sweatshirt
Retractable Garden Cart says
Ferulic Acid
Camping Wagon Trolley
99593-25-6
Urban Techwear Men's Olive Green Hoodie With Multiple Pockets Comfortable Heavyweight Cotton Perfect For Casual Streetwear
Classic Green Fleece Hoodie For Men With Zippered Pockets Cozy Cotton Blend For Running
Pregabalin In Elderly
Automatic One Person Tent Single Door
Men's White Lace-up Hoodie 2024 Trendsetting Sweatshirt With Custom Logo Relaxed Fit Cozy Pullover Streetwear Top
beta.carrara.poznan.pl
Men's Purple Full Zip Hoodie With Green Accent Graphics Comfortable Cotton Blend Streetwear Hooded Sweatshirt
Adcirca
Ethyl Glycidate Oil
Outdoor Camping Folding Wagon Trolley
Rear Opening Wagon
Trendy Green Men's Hoodie With Contrast Logo Soft Cotton Blend Hooded Top For Casual Or Athleisure Wear With Snug Ribbed Hem
Retractable Garden Cart
Women's White Hoodie And Jogger Set Casual Cotton Two Piece Loungewear High Quality Oversized Comfort Tracksuit says
Plush Dog Toy
Electric Scooter Two Wheels
Cheap Electric Scooters For Adults
Rope and Tug Toy
Women's Pink Hoodie And Jogger Set Casual Oversized Fleece Two Piece Winter Loungewear Tracksuit Comfortable Fit
Women's Beige Zip Up Hoodie Soft French Terry Oversized Sweatshirt Casual Street Style Loungewear With Pockets
Plush Dog Toy
Citycoco
Vacuum Cleaner for Home and Car
Custom Design Zip Up Hoodie Set For Women In Black And White Oversized Comfortable Casual Sweatshirt With Pockets
http://www.budemzdorovi.ru
E Motorcycle
Chew Dog Toy
Women's Neon Graphic Black Hoodie And Joggers Set Streetwear Fashion Oversized Comfortable Athleisure Two Piece
Scooter Electric
Women's White Hoodie And Jogger Set Casual Cotton Two Piece Loungewear High Quality Oversized Comfort Tracksuit
Hex Bolt says
idearico.com
T Head Bolt
Wholesale Soft And Comfortable Customized Embroidery Short Sleeved T-shirt Perfect For Logo Customization
Stainless Steel Float Valve
Fashion-forward White T-shirt For Women With Space For Logo Oversized Fit High-quality Cotton Fabric
Carriage Bolt
Square Head Bolt
Double Angle Stop
Classic White Unisex Hoodie Premium Cotton Blend Soft Comfortable Pullover For Casual Streetwear High-quality Winter Clothing
Through Conduit Pressure Seal Gate
Parabolic Disc Globe Valve
Gate Check Valve
Flange Bolt
Women's Purple Cotton Hoodie Soft High-quality Loungewear Winter Clothes Casual Pullover Sweatshirt With Hood
Custom Text Diy Tshirts For Women Wholesale Blank Tees Premium Fabric Multiple Color Options Perfect For Personalizati
Hex Bolt
Ceramic Block Magnets says
Magnet Sources
Name Plate
Fitness Crop Tops Tank Top For Women Guangzhou Wholesale Pink Shoulder Panel Summer Athletic Wear Customizable Logo Design
Magnet Powerful
Women's Athleisure Two-piece Hoodie And Jogger Set Cotton Sweatshirt Tracksuit In Earthy Tones Logo
Reluctance Motor Magnet
Metal Logo
China Bonded Ndfeb
Thin Metal Tags
kormakhv.ru
Manufacturer Wholesale Summer Design Light Decoration Round Neck Cotton Cute Cropped Top Slim Fit Women Tshirt
Customizable Women Tshirt Blank Cotton Tees For Printing Variety Of Colors Ideal For Logo Events Bulk Wholesale Available
Metal Label Logo
Plating Printed Labels
High-quality Purple Hoodie And Joggers Set For Women Custom Logo Soft Cotton Fast Delivery Options Casual Sportswear
Ceramic Block Magnets
Classic Black Zip Up Hoodie Men's Essentials Sweatshirt Cotton Vintage Streetwear Jacket says
Trailer Cover System
Clear Plastic Tarpaulin
Pool Cue Stick With Sliding Bridge Attachment
Trailer Tarpaulin Cover
http://www.pgusa.tmweb.ru
Auto Filters
Bridge Stick For Pool Players
Trailer Cover System
Men's Solid Brown Hoodie High-quality Cotton Oversized Streetwear Sweatshirt Plain Unisex Custom Hoodies & Essentials
Portable Bridge For Billiards
Men's Sports Hoodies 2024 Athletic Sweatshirts Customizable For Teams And Events
Logistics Equipment
Bold Red Oversized Hoodie Men's High-quality Cotton Streetwear Sweatshirt Unisex Custom Plain Hoodies 2024 Fashion
Retractable Pool Bridge Head
Men's Mustard Polo Collar Hoodie Cotton Oversized Vintage Streetwear Sweatshirt Plain High-quality Custom Hoodies
Classic Black Zip Up Hoodie Men's Essentials Sweatshirt Cotton Vintage Streetwear Jacket
Quenching System Machine Equipment for Aluminum Extrusion Profile says
Laserable Pc Film
Flat Polycarbonate Sheet
High Quality Vintage Women Casual Drop Shoulder 100% Cotton Unisex Print Blank Acid Wash Custom Oversized Women T Shirt
Automatic Handling System with Double-Operator Straightener
Wholesale Breathable Outfits Sportswear Yoga Multi Color Tops Wholesale Girls Workout Women Tshirt
Pc Film
Fully Automatic Aluminum Profile Cooling Table
Wholesale Summer Customization Logo Pure Color Women Tshirt Gym Cotton Knit Women Slim Crop Top T Shirts
Abs Sheet
Automatic Handling System with Single-Operator Straightener
Pvc Roof Sheets
Aluminum Profile Handling Table for Aluminum Extrusion Production Line Equipment
Urban Style Blue Cotton Hoodie For Men Heavyweight Streetwear Pullover With Timeless Appeal
Women's Pink Sweatshirt And Shorts Set Custom Colors/logo Option High-quality Cotton Comfort Wear For Casual Or Sporty Style
renobeya.com
Quenching System Machine Equipment for Aluminum Extrusion Profile
500mm Pe Pipe Extrusion Line says
Pe Pipe Extrusion Machine
Custom Bleached Blank Ribbed Corded Sweatshirt Women Textured Orange Crewneck Sweatshirt Soft Women's Fashion Top
Pneumatic Solenoid Valve
Classic Zip-up Cotton Hoodies For Women Versatile Streetwear Outerwear With Pockets Hoodies Daily Wear Unisex Streetwear
accentdladzieci.pl
Quirky Monster Design Sweatshirt For Boys Creative Imaginative Play Top Quirky Boys Monster Design Sweatshirt
Brake Pressure Modulator Valve
Three Layers Ppr Water Pipe Production Line Pp-H Hot
Brakes Rotors And Calipers
Pe Pp Pipe Making Machine
PE PPR Pipe Extrusion Production Line
Brake Pad Caliper
Hydraulic Solenoid Valve
Wholesale Without Hood Plain Ladies Sweatshirts Chic Cream Embossed Logo Sweatshirt For Women Elegant Cotton Casual Pullover
Minimalist Crewneck Sweatshirts In Neutral Tones Soft Casual Cotton Tops For Women Pullover Plush Women Crew Neck Sweatshirt
500mm Pe Pipe Extrusion Line