Transformers (BERT and GPT)

Transformers are the type of deep learning model architecture that poses a significant capability in handling NLP tasks. This made them broadly utilized in NLSP tasks like machine translation, text summarization, question answering, and language understanding. Pre-trained transformer models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) have gained remarkable performance and have been used as the foundation for many downstream applications in natural language understanding.

The basic components of a transformer are encoder and decoder. The encoder processes the input sequence, while the decoder generates the output sequence. Both the encoder and decoder consist of multiple layers of self-attention mechanisms and feedforward neural networks. Encoders and decoders can be used standalone or combined in models resulting in Encoder, Decoder and Encoder-Decoder models.

Encoders

Encoders are one of the basic building blocks of Transformers. Encoders are used for processing the input sequence, which can be a set of words or tokens. An encoder processes the input sequences and converts them to its corresponding meaningful numerical representations, which is an array of numbers for a single word or token. These numerical representations are called feature vectors. Example of a feature vector for a word is below -

‘Transformer, tutorial, from, Styrish, AI’ -> [.3, .23, .5], [.42, .22, .75], [.14, .33, .61], [.5, .21, .14], [.58, .77, .6]

If you see above, each token is mapped to its corresponding feature vector. Feature vector for specific token is generated by transformers' self-attention mechanism, which helps to compute attention scores for each token in sequence in relation to all other words in the sentence (attention score can be different for same token in different sequence based on the context in sequence). These scores reflect the importance of each word when forming the representation of the token. In Encoders, self-attention mechanism assigns the attention score of a token based on sequence/or words at both the sides of it (left and right), to capture the importance of the target token in a specific context.

Thus, these models are often characterized as having bi-directional attention mechanism and are often called auto encoding models. Models designed for NLP tasks like text classification, sentiment analysis etc. usually have encoders only, as these types of tasks require both the sides of sequence (left or right), in order to deduce its importance within the context. For example, "I am going for running." and "Running is important to stay fit.". Here, "Running" may have different importance in both the sequences, which can be decided by model while moving from both the sides of sequence. (right to left and left to right). One of the popular examples of encode models based on the transformer architecture is BERT (Bidirectional Encoder Representations from Transformers), which was developed by Google AI, and primarily used for text classification (sentiment analysis, spam detection etc.), question answering etc. type of NLP tasks. Let's understand BERT.

BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model based on transformers architecture, and majorly used for NLP tasks such as text classification, sentiment analysis, spam detection etc. We have already discussed above that BERT uses encoder-based model, which travels across sequence in both the directions to decide the importance of a target token in the input sequence. The algorithm does this task is called self-attention mechanism.

BERT (Bidirectional Encoder Representations from Transformers)

Pre-training of BERT

BERT has been trained on large amount of publicly available text data on internet. This process is called pre training. BERT's pre-training includes masked text prediction task, where the word within the context has been masked to do the prediction by BERT, as well as understanding the relationships between the sentences. BERT can be loaded from hugging face repo using nlp pipelines and is ready to use.

If you see in above screenshots, BERT is loaded from Hugging face nlp pipeline, and used for question-answering task. But here it seems, it is not giving pretty effective answer, for that we need to finetune the pre-trained model on specific dataset. Below is the same example using Hugging face finetuned model. (You can also finetune BERT on your custom dataset)

Finetuning of BERT involves the similar steps needed for training any traditional deep learning model. Fine tuning means adapting the pre-trained transformer model on custom dataset. Any pre-trained transformer model can be loaded from Hugging face and further finetuned on a specific dataset (dataset may be specific to your organization or else). Finetuning process includes below steps.

Select a pre-trained model.
- Load the pre-trained transformer model which is suitable for your task. Hugging face library provides various pre-trained BERT models available for further finetuning.
Preparing the dataset.
- Prepare the dataset specific to your requirement, on which the BERT model will be trained.
- Ensure that data is formatted appropriately.
Tokenizing and Embeddings.
- Tokenize the data using BERT tokenizer. The similar tokenizer should be used on which the BERT is pretrained.
- Create the word embeddings from the tokenized data. This includes tokenIDs, segmentIDs and attention masks.
Training and Validation of model.
1. Run the training of BERT on the prepared dataset. This process is a similar to other deep learning models, which includes calculate the loss, update the weights using back propagation and do the optimization.
2. Run the validation on validation-dataset, make sure the loss is decreasing and accuracy is going high after each iteration or epoch.
Saving the model.
1. Once satisfied with validation results, save the model for future use.

(Lab exercise Finetuning BERT using Pytorch attached in the bottom of the page.)

Finetuning of BERT

Decoders