GPT-2 - Language Models are Unsupervised Multitask Learners

Published:

Original Paper link: https://cdn.openai…learners.pdf
PDF: Download Summary

Executive Summary

This paper introduces GPT-2, a large-scale language model, and presents a groundbreaking hypothesis: that a sufficiently large neural network trained on a massive, high-quality, and diverse dataset can learn to perform a wide range of tasks (like translation, summarization, and question answering) without any explicit supervision.

The central idea is to move away from the dominant “supervised learning” paradigm, which requires creating a separate, hand-labeled dataset for every single task. The authors argue this supervised approach is:

  • Brittle: Models trained this way are “narrow experts” that fail when inputs change slightly.

  • Unscalable: It is not feasible to manually create labeled datasets for the thousands of tasks we might want an AI to perform.

The paper proposes that the simple, unsupervised objective of next-token prediction, when scaled up massively, is a “meta-task” that forces the model to learn grammar, facts, reasoning, and even the “patterns” of various tasks found in the data. This positions language modeling as a path toward more general and robust AI.

What is the core approach?

The approach has four main pillars:

  1. A new, massive, high-quality dataset (WebText).
  2. An adaptive, universal input representation (Byte Pair Encoding).
  3. A scaled-up Transformer architecture (GPT-2).
  4. A new evaluation method (“zero-shot" learning).

The Training Objective: Unsupervised Language Modeling

The model is trained on a single, simple task: guess the next word in a sequence of text. The core hypothesis is that for a model to get extremely good at this on a massive, varied dataset, it must implicitly learn concepts like grammar, facts, reasoning, and even the “patterns” of tasks (e.g., a news article followed by TL;DR: and a summary).

The model optimizes the probability of a text sequence \(x\) by factorizing it as a product of conditional probabilities:

\[p(x)=\prod_{i=1}^{n}p(s_{n}|s_{1},...,s_{n-1})\]

Where:

  • \(p(x)\): The probability of a sequence of symbols \(x\).
  • \(x\): A sequence of symbols (words or tokens) \((s_{1},s_{2},...,s_{n})\).
  • \(p(s_n \| s_1, ..., s_{n-1})\) : The conditional probability of the \(n\)-th symbol, given all \(n-1\) preceding symbols.

It is critical to distinguish how this objective functions during training versus during inference:

  1. During Training (Optimizing): The model is given the correct sequence (e.g., Delhi is the capital). At each step, it is forced to maximize the probability it assigns to the known correct next token. For the context Delhi is the, the model is adjusted to increase the probability it assigns to the token capital.
  2. During Inference (Generating): After training, the model’s weights are frozen. If we give the trained model the prompt Delhi is the, it outputs a probability distribution for all possible next tokens. We then select a token (e.g., the one with the highest probability, capital) to generate the next word.

The Dataset: WebText

The paper’s hypothesis requires a “sufficiently large and diverse” dataset. Existing datasets were unsuitable:

  • Small Benchmarks (e.g., Penn Treebank): Too small and single-domain.
  • Large Web Scrapes (e.g., Common Crawl): Massive but low-quality, full of “junk text.”

This “unintelligible or junk text” referenced from Common Crawl is a key problem. This is not just poorly written prose, but non-linguistic or non-coherent text that gets scraped from websites, including:

  1. Boilerplate Text: Navigation menus, “Copyright 2024”, “Terms of Service”, “Login”, “View Cart.”
  2. SEO Spam: Pages auto-generated to stuff keywords (e.g., “Best cheap laptops. Our cheap laptops are best. Buy cheap laptops.”).
  3. Garbled/Machine Text: Placeholder text like “Lorem ipsum…”, error messages, or jumbled characters from encoding errors.

If a model trains on this data, it wastes capacity learning to predict “junk” rather than useful human knowledge.

To solve this, the authors created WebText:

  • Source: All outbound links from the social media platform Reddit.
  • Filtering: They only kept links that received at least 3 karma (user upvotes).
  • The Heuristic: Using “3 karma” as a proxy for quality assumes that if a human found the link interesting or valuable, the content is likely to be coherent, human-written text.
  • Final Dataset: 40 GB of text from ~8 million documents (with all Wikipedia content removed to prevent data contamination on tests).

The Input Representation: Byte Pair Encoding (BPE)

Before the model can process the WebText data, the text must be converted into a list of numeric tokens. This is a critical step, and the paper avoids two common but flawed approaches:

  • Word-level Tokenization: This treats every word (e.g., “apple”, “banana”) as a unique token. The problem is that the vocabulary becomes massive (500,000+ words), and the model has no way to handle “Out-of-Vocabulary” (OOV) words. If it sees a new word like “anthropomorphism”, it maps it to a single <UNK> (unknown) token, losing all meaning.
  • Character-level Tokenization: This treats every character (“a”, “p”, “p”, “l”, “e”) as a token. This solves the OOV problem (it can represent any word), but it is extremely inefficient. The model must learn the concept of “apple” from scratch, and text sequences become very long, straining the model’s context window.

The Solution: Byte Pair Encoding (BPE)

BPE is an adaptive “middle-ground" solution. It is a sub-word tokenization algorithm that “learns" its vocabulary from the data.

How it works (Simplified Example):

  1. Start: The initial vocabulary consists of all individual characters (e.g., {'b', 'g', 'h', 'i', 'n', 's', 'u'}).
  2. Analyze Corpus: It analyzes the training data (e.g., {"hug", "hugging", "bug", "bugs"}).
  3. Find & Merge: It finds the most frequently occurring adjacent pair of tokens. In this example, (u, g) is the most common pair (it appears 4 times).
  4. Update Vocab: It “merges” this pair into a new, single token "ug" and adds it to the vocabulary. The vocab is now {'b', 'g', 'h', 'i', 'n', 's', 'u', 'ug'}.
  5. Repeat: It repeats this process. The next most common pair might be ("bug", "s"). It merges this into a new token "bugs".
  6. This process is repeated for a set number of merges (e.g., 50,000), resulting in a vocabulary that contains common characters, common sub-words (like "ing"), and very common full words (like "the").

Advantages of BPE:

  1. Eliminates OOV: It provides the “best of both worlds.” It can represent any new or rare word (like “anthropomorphism”) by breaking it down into sub-word pieces it knows (e.g., ["anthro", "po", "morph", "ism"]).
  2. Efficiency: It is far more efficient than character-level, as common words are still represented as single tokens.
  3. Shared Meaning: The model can infer meaning about new words. It can understand that “hugging” is related to “hug” because they both contain the "hug" token.

GPT-2’s Specific Implementation (Byte-Level BPE)

The authors use a byte-level BPE. Instead of starting with 1̃30,000 Unicode characters, they start with the 256 basic bytes. This is a crucial decision, as it means the model can literally represent any text string in any language, including emojis or garbled code, without ever needing an <UNK> token. It is a truly universal input representation.

The Model: GPT-2 Architecture

The model is a scaled-up version of the first GPT, a “decoder-only” Transformer architecture. To test their “scaling” hypothesis, the authors trained four sizes:

  • Small (117M parameters): Identical to the original GPT.
  • Medium (345M parameters): Similar in size to BERT-Large.
  • Large (762M parameters):
  • Extra-Large (1.5B parameters): The full model, referred to as GPT-2.

To successfully train a model this deep (e.g., 48 layers in the 1.5B version), the authors made several key architectural modifications:

  1. Layer Normalization (Pre-activation): In a standard Transformer, Layer Normalization (LayerNorm) is applied after the main operation. In GPT-2, it is applied before. This “pre-activation” style provides a “cleaner” path for the signal (gradient) to propagate, which is crucial for stabilizing very deep networks.
  2. Additional Layer Normalization: A single, final LayerNorm operation was added after the last Transformer block to ensure the final output of the network was well-scaled and stable.
  3. Modified Initialization: The initial weights of the residual layers were scaled by a factor of \(1/\sqrt{N}\) (where \(N\) is the number of residual layers). This prevents the outputs of many layers from “accumulating” and “exploding” at the start of training.
  4. Increased Context Size (1024 tokens): The model’s “field of view” was doubled from 512 to 1024 tokens. This was done by simply creating a larger positional encoding table (1024 entries vs. 512) and training the model on longer text sequences. This larger “ruler” allows the model to learn relationships across a much longer span of text, which is essential for tasks like summarization.

The Evaluation: Zero-Shot Task Transfer

This is the key method for testing the paper’s hypothesis.

  • Zero-Shot - This means the model’s parameters are frozen after its unsupervised training. It is evaluated on a new task (e.g., translation) without receiving any examples or gradient updates from that task’s training set.

The authors “condition” the model by providing a prompt that mimics the format of the desired task. The model’s “answer” is simply its generated continuation of that prompt.

Examples:

  • Reading Comprehension: The model is fed the document, the conversation history, and the final token A:. The text it generates next is its answer.
  • Summarization: The model is fed the full article text followed by the prompt TL;DR:. The text it generates next is its summary.
  • Translation: The model is prompted with a few examples (english sentence = french sentence) and then the final prompt (new english sentence =).

What are the key findings?

The paper’s results are presented in two parts: its performance as a language model and its performance as a zero-shot task learner.

Findings: Language Modeling

Before testing tasks, the authors confirmed GPT-2 was a powerful language model.

  • The Result: GPT-2 (1.5B) achieved state-of-the-art (SOTA) results on 7 out of 8 standard language modeling datasets in a zero-shot setting.

The one dataset GPT-2 did not achieve SOTA on was the One Billion Word (1BW) Benchmark. This was not a failure of the model, but a fundamental mismatch between the model’s design and the dataset’s construction. The 1BW benchmark’s creators “pre-processed” the data by shuffling all the sentences, thereby destroying all long-range relationships between them. GPT-2’s core strength (its 1024-token context window) was specifically designed to learn these exact long-range dependencies. Because the 1BW benchmark intentionally removes this structure, it nullifies the model’s primary advantage, leading to poor performance.

  • Key Example (LAMBADA): In sharp contrast, on datasets designed to test long-range dependencies like LAMBADA, GPT-2 drastically improved the SOTA perplexity (a measure of “confusion”, where lower is better) from 99.8 to 8.6.

Findings: Zero-Shot Task Transfer

This was the primary test of the hypothesis, and the results were “promising” but mixed. To quantify the performance on these tasks, the authors used standard NLP metrics, most notably the F1 score and the BLEU score.

  • The F1 Score, used for CoQA, is the harmonic mean of Precision and Recall. It measures the overlap between the model’s generated answer and the correct human-written answer. A high F1 score (0-100) means the model is both correct (high precision) and complete (high recall).
  • The BLEU Score (or BiLingual Evaluation Understudy), used for translation, measures the quality of a machine translation by comparing its n-grams (word pairs, triplets, etc.) to high-quality human translations. A high BLEU score (0-100) means the translation is very similar in phrasing to a human’s.

Task-by-Task Results:

  • Reading Comprehension (CoQA): This was a major success. The 1.5B model achieved an F1 score of 55, which matched or exceeded the performance of 3 out of 4 supervised baseline systems that were explicitly trained on over 127,000 question-answer pairs.

  • Summarization (CNN/Daily Mail): This was a weak point. Quantitatively, the performance was poor, only slightly better than a baseline of selecting 3 random sentences.

  • Translation (WMT-14 Fr-En): The model achieved 11.5 BLEU. This was better than some older unsupervised baselines but significantly worse than the SOTA unsupervised method at the time.

  • Question Answering (Natural Questions): The 1.5B model correctly answered 4.1% of questions, which was far behind dedicated open-domain QA systems (which scored 30-50%).

What is the core conclusion? (Why is this paper important?)

The single most important finding of the paper is not that GPT-2 beat every benchmark, but how its performance changed.

The Finding: For every task evaluated (Summarization, QA, etc.), performance consistently and predictably improved as model size increased.

This relationship is described as log-linear: a linear improvement in performance for a logarithmic increase in model size. This suggests that the “emergent” zero-shot capabilities are a direct, predictable function of model scale.

The paper’s conclusion is that model capacity is essential for the success of zero-shot task transfer. This finding provided a clear, actionable path forward for the field: build bigger models.

What are the future implications?

The authors note that even their largest 1.5B parameter model was still underfitting the 40 GB WebText dataset. This implies that performance had not yet plateaued and that building even larger models trained on even more data would likely lead to further gains.

This paper set the stage for the “scaling laws” era of AI and led directly to the development of its successor, GPT-3 (175 billion parameters), which demonstrated far more powerful zero-shot and few-shot abilities, confirming the paper’s central hypothesis.