April 2, 2021
Fine-Tuning Transformer-Based Language Models
Transformer-based language models have revolutionized the NLP space since the introduction of the Transformer, a novel neural network architecture, in 2017. Today, the most advanced language models heavily rely on transformers and are now considered the state-of-the-art models for all major NLP/NLU tasks. Google’s BERT (2018) and OpenAI’s GPT (2018) models are two of the earliest pre-trained language models that used transformers in their architecture.
Pre-trained language models are NLP models that use a large corpus of textual data and undergo a heavy training procedure which often takes days or weeks, and uses extensive computational resources. These models are trained in such a way that they “learn” the grammatical structures and semantics of a language, and “understand” sentences so that they can predict the next word after a sequence of words or a missing word within a sentence. Once the training process is completed, the resulting model is called a “pre-trained language model”.
There are plenty of transformer-based pre-trained language models in the field. They differ from one another based on model architecture, hyperparameter selection, the training objective, and the dataset that the model was trained on. For instance, Google’s BERT was trained on the entire corpus of English Wikipedia and Brown Corpus, whereas GPT models were trained on WebText data which is mainly scraped from links in Reddit comments (Radford et al., 2018). These two famous models not only differ by training datasets but also the pre-training objective and the usage of transformer architectures as well. More on this in the transformers section below.
Since the style of language that is used to train these pre-trained models could be different than one might need for their particular application, it is necessary to fine-tune these models for one’s own domain and target task. “Fine-tuning” in NLP refers to the procedure of re-training a pre-trained language model using your own custom data. As a result of the fine-tuning procedure, the weights of the original model are updated to account for the characteristics of the domain data and the task you are interested in. This is one of the most significant promises of pre-trained language models. In this post, we will discuss different strategies of fine-tuning pre-trained language models. We will specifically focus on the BERT architecture and its variants, such as RoBERTa and DistilBERT, and discuss our findings.
As the name suggests, transformer-based pre-trained language models utilize the transformer architecture introduced by Vaswani et al (2017).
As shown in Fig. 1 (from here), transformers are composed of two blocks: an encoder and a decoder. The encoder block takes the text inputs and processes them through embedding, attention and feed forward layers, and generates abstract, high dimensional representations for each word (called “tokens”) in the text. The decoder block, on the other hand, takes the encoder outputs and decomposes them to predict the representation in the target language using the current token from the encoder and the previous output of decoder.
Since the introduction of the Transformer architecture, there has been a significant development in NLP methods, tools, and applications. Transformers are now at the core of many language models such as Google’s BERT, Facebook’s RoBERTa, OpenAI’s GPTs, and XLNet. These language models consistently achieve best results for all major NLP/NLU tasks such as sequence classification, question answering, token classification, language inference etc.
Although all these models use transformers in their architecture, some of these models - called autoencoders - only employ the encoder component of the transformer. For example, BERT (along with its successors) is an autoencoder model which only uses the encoder part of the transformer. The goal of autoencoder models is to represent the input text in a high dimensional space so that the entire text (and each token as well) is encoded for downstream tasks. Autoencoder models are used to solve problems that involve Language Understanding (NLU), such as understanding the intent of a message or extracting entities from a given text.
Models that rely on the decoder component of the transformers, on the other hand, are called autoregressive models. The goal in autoregressive models is to predict next word given the previous words and therefore they are mainly used in Language Generation tasks. GPT and its successors (GPT-2 and GPT-3) are the most famous autoregressive models. That being said, one can still generalize autoregressive models by training on an autoencoder task and vice versa, XLNet is a good example of this effort.
At YMeadows, we develop NLP solutions for our clients to help them understand their customers' messages. Since language understanding (NLU) is our main focus, we extensively use different autoencoder models for custom data and target tasks. Some of the pre-trained language models (autoencoders) that we fine-tune for our datasets are BERT, RoBERTa, DistilBERT, and CamemBERT & FlauBERT (French language models).
Before going through the fine-tuning strategies, we will review the BERT architecture and dive deep into its components to better understand how a fine-tuning can be performed.
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based pre-trained language model which was introduced by Google (Devlin et al., 2018). BERT relies on the encoder part of the transformers and it is bidirectional, as opposed to GPT in which transformers are used in a unidirectional fashion during training.
- Embedding Layer
- Encoder layers (encoder parts of transformers - 12 of them for BERT-base and 24 for BERT-large)
- Pooled output layer (dense neural network for the CLS token)
In a nutshell, when BERT is trained, it is trained to predict a masked token (word) given the context, along with predicting if two sentences follow each other, i.e., next sentence prediction task. After the introduction of BERT, there has been a series of attempts to either improve BERT or apply its architecture to different datasets. Each resulting pre-trained language model then adapted the naming convention.
For instance, Facebook has created a pre-trained language model and named it RoBERTa, (Robustly Optimized version of BERT) using the BERT architecture. Facebook used different hyperparameters than BERT, and trained the model to predict only the masked token by omitting the next sentence prediction task. The training dataset is another component that distinguishes Facebook’s RoBERTa from Google’s BERT, as Facebook used CC-News data which was a novel textual dataset obtained from news articles in addition to other open NLP datasets. RoBERTa improved the performance of BERT in a variety of NLP tasks and advanced the state-of-the-art results in NLP in 2019. (Liu et al., 2019)
BERT’s architecture was also adapted to train models in other languages. CamemBERT, for instance, is a French pre-trained language model that relies on BERT and was trained using the OSCAR corpus, which contains text of French webpages (Martin et al., 2020).
The fine-tuning procedure for all these models are similar since they share similar architectures, i.e., embedding layers, encoder layers, and pooled output layers. Therefore the fine-tuning strategies we are going to see in the following section can be applied to any model that uses the general BERT architecture. In general, the steps we describe can be applied to any pre-trained language model from ULMFiT (a language model introduced before transformers) to the most recent and advanced transformer-based language models. However, in this post we will mainly focus on fine-tuning the BERT model and its successors.
A high level view of BERT Model as shown here : it consists of embedding layer which generates numerical representations for words, blocks of encoders (12 for BERT-base and 24 for BERT-large), and a fully connected neural network which performs the target task, i.e., classification layer for sequence classification tasks.
BERT and its variants can be fine tuned in many different ways. Here we will briefly summarize some of the most popular ways it is done:
- Fine-Tuning with Target Data/Task: One can fine tune the entire architecture of BERT model with the custom data and target task. This is usually what is meant by fine-tuning in the NLP community; however, it is not the only way to fine-tune the BERT model and its variants.
- Layer Freezing: Since different encoders of BERT captures different syntactic and semantic information, one can choose to fine-tune only certain encoders or blocks of encoders. More on this in the following sections.
- Layer-wise Learning Rate: Similar to the second point, different layers can be fine-tuned with different learning rates during training.
In our experiments at YMeadows, we focused on the first two approaches and applied these methods to a variety of pre-trained language models including BERT, RoBERTa, DistilBERT, CamemBERT, and FlauBERT on different client datasets. The target task for all our models was to predict intent of a customer from their messages/emails.
Recall that fine-tuning is the procedure to update the parameter weights of a pre-trained language model for the domain data and target task. Layer freezing, on the other hand, refers to the procedure wherein the layer(s) of a model is frozen, i.e., parameter weights of the original pre-trained model are left untouched. In other words, freezing a layer means that fine-tuning is not applied to that layer.
Since there are 12 encoder layers and an embedding layer, there are literally thousands of different ways to perform layer freezing strategies (consider all the different combinations of 13 layers). We have only applied some of the most common patterns of layer freezing. This paper summarizes different combinations of layer freezing techniques and presents results for each experiment.
When it comes to layer freezing, a natural question might be “why would you freeze some of the layers when you could fine-tune an entire model with your domain data?”. Both in our experiments and in the literature, we have seen that freezing some of the layers oftentimes improves the model performance and improvements can be significant depending on the strategy followed. That said, applying layer freezing without a prior knowledge can generate undesired results. At YMeadows, we have extensively evaluated how different approaches work and which fine-tuning strategy to follow to improve model performance for a given domain.
In our experiments, we started with freezing the entire pre-trained model, i.e., we relied on the weights of pre-trained language models that were trained on different datasets, i.e., news articles, Wikipedia, literary works etc. As a company working in the customer service domain, the language of our text data is quite different than that of those pre-trained language models trained on. Therefore, as expected, the performance of the models on target tasks with this approach was significantly low. Note that this is not a fine-tuning of a language model and the reason we tested it was to confirm the need for fine-tuning for datasets in different domains. This also supports the idea that language distribution of a particular domain can substantially differ from another which could result in poor model performance.
We then trained models by fine-tuning the entire model architecture, meaning, we re-trained the pre-trained language models and updated their parameter weights using domain data. The results were significantly higher than what we had observed when freezing entire architecture. This was our baseline for the fine-tuning results. After establishing a baseline, we applied exhaustive layer freezing schemes to the same datasets and tasks, and observed significant improvements with certain schemes. Our models, therefore, generate higher accuracy in intent classification tasks than a fully fine-tuned model. We have observed that the results -compared to the baseline model- can be improved between 5 to 10% depending on the layer freezing scheme selected.
In addition to freezing and unfreezing certain layers for fine-tuning, one can apply different learning rates to each layer, as opposed to applying one learning rate to the entire model. The idea behind layer-wise learning rate is to treat different layers separately because each layer captures a different aspect of domain language and supports the target task uniquely. Similar to freezing layers, this approach requires an exhaustive experimentation and a special care must be exercised in order for the fine-tuned model to “learn” the domain language and perform the task at hand.
When fine-tuning a pre-trained language model with different learning rates, it is possible to fall into a trap known as catastrophic forgetting, if the learning rate is not selected properly. Recall that pre-trained language models are trained with a large corpus of textual data so that they “learn” and “understand” the languages, and fine-tuning is the process of transferring this learned language to a new domain data to solve the target task. The parameter weights of resulting fine-tuned model are usually within the neighborhood of the weights of original pre-trained model.
As a matter of fact, the name fine-tuning also implies the fact that resulting models should be “slightly” modified version of the original pre-trained models. Otherwise, the information that was learned with the pre-trained language models might be lost, therefore catastrophic forgetting may occur. This would result in poor model performance for the target task. Catastrophic forgetting mainly occurs when the learning rate is higher than what it is supposed to be; the original BERT paper suggests that a learning rate for fine-tuning should be between 0.00002 and 0.00005. On the other hand, selecting lower learning rates such as 0.000001 or 0.0000001 may not help to improve the model performance, since the fine-tuned model would not be much different than the original pre-trained model. At YMeadows, we have experimented with different learning rates and identified a range of values that works best for fine-tuning in our domain.
Discussion and Conclusion
Recent advancements in the NLP space have created a significant interests in tools, applications, and methods for processing and understanding textual data that is generated at the petabytes level on a daily basis. In this blog post, we discussed different strategies that can help one to get the most out of these pre-trained language models for their own data and tasks. In short, although these pre-trained language models advance the state-of-the-art results, they still need to be tailored to perform custom tasks using domain datasets which is usually different than what these models have seen during training. One must be careful when fine-tuning these advanced language models in order to obtain improved results, as doing it wrong usually deteriorates model performance.
In addition to fine-tuning a pre-trained language model, one can pre-train a domain specific language model with the BERT architecture using just the domain data. This approach requires a large amount of data and an extensive training procedure. For a given domain, intuitively, pre-training should perform better than a fine-tuned BERT model since it will learn the style of a specific domain, rather than just general English. To exemplify this phenomena, consider a customer service language vs general textbook language; their language distribution will be different since different domains use certain aspects of a language more frequently than others. At Y Meadows, we plan to develop a pre-trained language model that is specific to the customer service message domain which would better understand the customer service style of language. This is an exciting project and - to the best of our knowledge - it will be the first attempt for customer service domain since there is no publicly available pre-trained language model that is specifically trained using customer service text at the moment.