
Demystifying BERT: An Intuitive Dive into Transformer-based Language Models
The transformer neural network architecture initially created to solve the problem of language translation. It was very well-received because previous models like LSTM networks had several issues.
Issues with LSTM Networks
LSTM networks are slow to train because words are processed sequentially. It takes many time steps for the neural network to learn, and it isn’t the best at capturing the true meaning of words. Even bi-directional LSTMs, which learn left-to-right and right-to-left contexts separately, have limitations because the true context can be slightly lost.
Advantages of Transformer Architecture
The transformer architecture addresses these concerns:
- Speed: Words can be processed simultaneously, making it faster.
- Context: Better context learning as it processes words in both directions simultaneously.
Components of the Transformer
The transformer consists of two key components: the encoder and the decoder.
- Encoder: Takes English words simultaneously and generates embeddings for each word simultaneously. These embeddings are vectors encapsulating the word’s meaning, with similar words having closer vector values.
- Decoder: Takes these embeddings and previously generated words of the translated sentence to generate the next word in the target language until the end of the sentence is reached.
BERT and GPT Architectures
- GPT (Generative Pre-trained Transformer): Focuses on the decoder part of the transformer architecture.
- BERT (Bidirectional Encoder Representations from Transformers): Focuses on the encoder part. BERT can be used for tasks requiring language understanding, such as language translation, question answering, sentiment analysis, text summarization, and more.
Training Phases of BERT
Pre-training
The goal is to make BERT learn language and context using two unsupervised tasks:
- Masked Language Modeling: BERT takes a sentence with random words masked. The goal is to predict these masked tokens, helping BERT understand bi-directional context within a sentence.
- Next Sentence Prediction: BERT takes two sentences and determines if the second sentence follows the first. This helps BERT understand context across different sentences.
Fine-tuning
After pre-training, BERT can be fine-tuned for specific NLP tasks by:
- Replacing the fully connected output layers with fresh output layers specific to the task (e.g., question answering).
- Performing supervised training using a relevant dataset. This process is quick as only the output parameters are learned from scratch, while the rest of the model parameters are fine-tuned.
Detailed Explanation
Pre-training
During pre-training, BERT trains on masked language modeling and next sentence prediction simultaneously. The input consists of two sentences with some words masked. Words are converted into embeddings using pre-trained embeddings.
- Output: A binary output for next sentence prediction (1 if sentence B follows sentence A, 0 if it doesn’t) and word vectors for the masked language modeling task.
Fine-tuning
For tasks like question answering, modify the inputs and output layer:
- Inputs: Pass the question followed by a passage containing the answer.
- Output: Predict the start and end words of the answer within the passage.
Generating Embeddings
Embeddings are constructed from three vectors:
- Token Embeddings: Pre-trained embeddings with a vocabulary of 30,000 tokens.
- Segment Embeddings: Sentence number encoded into a vector.
- Position Embeddings: Position of a word within a sentence encoded into a vector.
Adding these vectors together forms the embedding vector used as input to BERT.
Output Side
- Binary Value (C): For next sentence prediction.
- Word Vectors: For masked language modeling.
Training involves minimizing loss using cross-entropy loss, focusing on predicting masked words to enhance context awareness.
Summary
BERT is pre-trained using masked language modeling and next sentence prediction. It then goes through a fine-tuning phase for specific NLP tasks, making it a versatile language model. The BERT large model, with 340 million parameters, can achieve higher accuracies compared to the smaller BERT base model with 110 million parameters.
This simplified explanation of BERT covers its structure, training phases, and how it can be fine-tuned for various NLP tasks.