BERT

Published:

Original Paper link: https://arxiv.org/abs/1810.04805
PDF: Download Summary

BERT, which stands for Bidirectional Encoder Representations from Transformers, represents a landmark shift in the field of Natural Language Processing (NLP). Its core contribution was to overcome the limitations of previous pre-training methods by introducing a framework for pre-training a deeply bidirectional model. Unlike its predecessors which were constrained to processing text in a single direction (left-to-right or right-to-left), BERT leverages a novel “Masked Language Model” (MLM) objective to learn information from the entire context of a word simultaneously. This, combined with a “Next Sentence Prediction” (NSP) task, allowed BERT to achieve a profound understanding of language. The pre-trained BERT model could then be quickly fine-tuned for a wide array of specific tasks, achieving state-of-the-art results on 11 major NLP benchmarks and establishing a new paradigm for transfer learning in NLP.

The Core Problem: The Unidirectionality Constraint

Before BERT, the dominant strategy for high-performance NLP was a two-stage process: pre-train a language model on a massive unlabeled text corpus, then fine-tune it on a smaller, task-specific labeled dataset. However, the pre-training methods were fundamentally limited.

The Pre-BERT Pre-training Task: Next-Word Prediction

The dominant pre-training strategy was unidirectional language modeling: predicting the next word in a sequence given the words that came before it. For the sentence The cat sat on the mat, the model would be trained to predict cat after seeing The, then sat after seeing The cat, and so on.

  • Loss Function: The model’s inaccuracy was measured by Cross-Entropy Loss, which quantifies the difference between the model’s predicted probability distribution for the next word and the actual correct word.

  • Architectural Limitation: This task created a significant conflict with powerful architectures like the Transformer. A full Transformer Encoder is inherently bidirectional; its self-attention mechanism allows every token to look at every other token in the sentence. If such an encoder were used for next-word prediction, it could simply "cheat" by looking at the word it was supposed to predict.

To resolve this conflict, models like OpenAI GPT had to cripple the Transformer by using a unidirectional "decoder" architecture. This was achieved with a masked self-attention mechanism that explicitly forbade a token from attending to any tokens to its right (i.e., future tokens). This architectural compromise was the central problem BERT aimed to solve, as this unidirectionality is sub-optimal for tasks requiring holistic understanding of the entire input.

BERT’s Core Solution: Novel Pre-training Tasks

BERT’s innovation was not to invent a new architecture, but to invent new pre-training tasks that were compatible with a fully bidirectional Transformer encoder.

Task #1: Masked Language Model (MLM)

The MLM task is the core idea that enables deep bidirectionality. Instead of predicting the next word, BERT predicts randomly hidden words.

  • Process: 15% of the tokens in an input sequence are randomly selected. These tokens are then replaced according to a specific strategy: 80% are replaced with a [MASK] token, 10% with a random word, and 10% are left unchanged.

  • Objective: The model’s goal is to predict the original value of these replaced tokens. To do this, it must rely on the full context surrounding them—both to the left and the right.

  • Model Output and Loss: The BERT model outputs a final hidden vector for every input token. For the MLM task, only the vectors corresponding to the masked positions are fed into a final softmax layer to produce a probability distribution over the vocabulary. Critically, the loss is only calculated on these 15% of predictions; the other 85% of tokens are used purely as context. This makes the MLM task different from a denoising autoencoder, which would try to reconstruct the entire input.

This pre-training objective forces the model to learn rich, contextual representations for every token, as it never knows which ones it will be asked to predict.

Task #2: Next Sentence Prediction (NSP)

To teach the model to understand relationships between sentences, BERT was also trained on a binary classification task. The model is given two sentences, A and B, and must predict if B is the actual sentence that follows A in the corpus or if it is just a random sentence. This forces the model to learn about sentence coherence and logical flow.

The BERT Framework: Architecture and Implementation

Model Architecture

BERT’s architecture is a multi-layer Transformer Encoder, as described in the original “Attention Is All You Need” paper. The paper presents two sizes:

  • \(BERT_{BASE}\): 12 layers, 768-dimensional hidden vectors, 12 attention heads.

  • \(BERT_{LARGE}\): 24 layers, 1024-dimensional hidden vectors, 16 attention heads.

A Critical Consequence of the Architecture

Because BERT is an encoder pre-trained with MLM, it is not designed for causal generation (i.e., open-ended text generation like “what comes next?"). Its strength lies in understanding and representing existing text, not in generating new text autoregressively. This is a fundamental trade-off against decoder-style models like GPT.

Input Representation

A key feature of BERT is its ability to handle both single sentences and sentence pairs. The input representation for each token is the sum of three embeddings:

  1. Token Embeddings: The embedding for the specific word or sub-word piece.

  2. Segment Embeddings: An embedding indicating if the token belongs to Sentence A or Sentence B.

  3. Position Embeddings: An embedding that encodes the token’s position in the sequence, compensating for the Transformer’s lack of inherent order.

The input is also framed by two special tokens: [CLS] at the beginning, whose final representation is used for sequence-level classification tasks, and [SEP] to separate sentences.

The Two-Stage Process: Pre-training and Fine-tuning

BERT’s workflow is a simple yet powerful two-stage process:

  1. Pre-training: The expensive, one-time phase where the model is trained on MLM and NSP tasks using a massive unlabeled dataset (Wikipedia, BooksCorpus). This results in a powerful, general-purpose language model.

  2. Fine-tuning: A much faster and cheaper phase. The pre-trained model is initialized, and a minimal task-specific output layer is added. The entire model is then trained for a few epochs on a small, labeled dataset for a specific downstream task (e.g., sentiment analysis, question answering).

Experimental Results and Key Findings

BERT’s performance was a watershed moment for NLP. It established new state-of-the-art results on 11 diverse tasks. The paper’s ablation studies scientifically validated its core design choices:

  • Bidirectionality is Crucial: An ablation experiment directly comparing BERT to a left-to-right (LTR) model of the same size showed the LTR model performed significantly worse on all tasks, proving that deep bidirectionality enabled by the MLM task is “strictly more powerful.” The NSP task was also shown to be vital for tasks requiring sentence-pair understanding.

  • Model Size Matters: Larger BERT models led to strictly better accuracy across all tasks, even those with very small datasets. This demonstrated that with sufficient pre-training, scaling model size provides significant benefits.

  • BERT is Versatile: The model proved highly effective for both the primary fine-tuning approach and a feature-based approach (where BERT’s embeddings are used as input to another model), making it a flexible tool for a wide range of applications.