Saguaro
Published on

Re: Implementation [01]: Decoder-Only GPT - Building a Character-Level Language Model

Authors
  • avatar
    Name
    Shuqi Wang
    Twitter

Re: Implementation SeriesEpisode 01

Welcome to my open notebook. In this series, I am rebuilding the most influential models in AI history to prepare for my MPhil research. No black boxes, just code and first principles.

Overview

We begin with one of the most influential references for learning transformer-based language models: Andrej Karpathy's nanoGPT video and repository. This minimal, elegantly-designed implementation offers both a clear pedagogical pathway and a solid foundation for understanding how GPT-style models work.

In this post, we will focus specifically on the pretraining phase of a decoder-only, character-level GPT architecture. We'll walk through the theoretical foundations, dissect each component of the architecture, and then build the complete model step-by-step from scratch.

Architecture Overview

The decoder-only architecture represents the blueprint behind modern autoregressive language models. Unlike encoder-decoder architectures (e.g., T5), a decoder-only model processes text sequentially, predicting the next token based on all previous tokens in the sequence.

Below is a visual representation of the decoder-only transformer block structure:

Decoder-Only Architecture

Implementation: Building the Model Step-by-Step

Now, let's implement this architecture from first principles. The following code dissects the standard decoder-only architecture used in GPT-style models. We've refactored Andrej Karpathy's nanoGPT to use clearer naming conventions and added detailed comments explaining each component.

Section 1: Hyperparameters & Data Loading

First, we define our training hyperparameters and load/encode our text data:

import torch
import torch.nn as nn
from torch.nn import functional as F

# ==================== Hyperparameters ====================
batch_size = 16          # Number of independent sequences processed in parallel
block_size = 32          # Maximum context length for predictions (sequence length)
max_iters = 5000         # Total training iterations
eval_interval = 100      # Evaluate loss every N iterations
learning_rate = 1e-3     # Learning rate for AdamW optimizer
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200         # Number of iterations for loss estimation
n_embd = 64              # Embedding dimension (hidden size)
n_head = 4               # Number of attention heads
n_layer = 4              # Number of transformer blocks
dropout = 0.0            # Dropout rate

torch.manual_seed(1337)

# ==================== Data Loading ====================
# Load text data (download from: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Build vocabulary: character-to-integer and integer-to-character mappings
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}          # string to integer
itos = {i: ch for i, ch in enumerate(chars)}          # integer to string
encode = lambda s: [stoi[c] for c in s]               # encoder: string → list of integers
decode = lambda l: ''.join([itos[i] for i in l])      # decoder: list of integers → string

# Convert entire text to tensor and split into train/val
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

# ==================== Batch Loading ====================
def get_batch(split):
    """Generate a small batch of data with inputs x and targets y."""
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    """Estimate average loss on train and validation sets."""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Section 2: Self-Attention Mechanism

The attention mechanism is the core of the transformer. Here, we implement a single attention head and multi-head attention:

class AttentionHead(nn.Module):
    """Single head of scaled dot-product self-attention."""

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # Register a causal mask to prevent attending to future tokens
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)      # (B, T, head_size)
        q = self.query(x)    # (B, T, head_size)
        v = self.value(x)    # (B, T, head_size)

        # Compute attention scores: Q @ K^T / sqrt(d_k)
        # Formula: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
        wei = q @ k.transpose(-2, -1) * (C ** -0.5)  # (B, T, T)

        # Apply causal mask: prevent attention to future positions
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)

        # Weighted aggregation of values
        out = wei @ v  # (B, T, T) @ (B, T, head_size) → (B, T, head_size)
        return out

class MultiHeadAttention(nn.Module):
    """Multiple attention heads running in parallel."""

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([AttentionHead(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)  # Linear projection after concatenation
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Concatenate outputs from all heads
        out = torch.cat([h(x) for h in self.heads], dim=-1)  # (B, T, n_embd)
        # Project back to embedding dimension
        out = self.dropout(self.proj(out))
        return out

Mathematical Intuition: Each attention head learns to focus on different aspects of the input. By running them in parallel and combining their outputs, the model can capture diverse relationships in the data.

Section 3: Feed-Forward Network

After attention, the model applies a position-wise feed-forward network:

class FeedForwardNetwork(nn.Module):
    """Position-wise feed-forward network: Linear → ReLU → Linear."""

    def __init__(self, n_embd):
        super().__init__()
        # Expand to 4x dimension, then contract back
        # Formula: FFN(x) = ReLU(x * W1 + b1) * W2 + b2
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

Section 4: Transformer Block

A transformer block combines attention, layer normalization, and feed-forward components:

class TransformerBlock(nn.Module):
    """Transformer decoder block: Multi-head attention → FFN with residual connections and layer normalization."""

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.attention = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForwardNetwork(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)  # Pre-norm layer normalization
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # Residual connection around attention (Pre-LayerNorm architecture)
        # Formula: x' = x + Attention(LayerNorm(x))
        x = x + self.attention(self.ln1(x))
        # Residual connection around FFN
        # Formula: x'' = x' + FFN(LayerNorm(x'))
        x = x + self.ffwd(self.ln2(x))
        return x

Section 5: Complete GPT Model

Finally, we assemble all components into the full decoder-only GPT model:

class DecoderOnlyGPT(nn.Module):
    """Decoder-only GPT language model for character-level text generation."""

    def __init__(self):
        super().__init__()
        # Token embedding: map character indices to embedding vectors
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        # Position embedding: encode absolute position in sequence
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # Stack of transformer blocks
        self.blocks = nn.Sequential(*[TransformerBlock(n_embd, n_head=n_head) for _ in range(n_layer)])
        # Final layer normalization
        self.ln_f = nn.LayerNorm(n_embd)
        # Language modeling head: project to vocabulary size
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        """
        Forward pass of the model.

        Args:
            idx: (B, T) tensor of character indices
            targets: (B, T) tensor of target character indices (optional, for computing loss)

        Returns:
            logits: (B, T, vocab_size) tensor of log probabilities
            loss: scalar loss value (None if targets not provided)
        """
        B, T = idx.shape

        # Embedding: combine token and position embeddings
        tok_emb = self.token_embedding_table(idx)                           # (B, T, n_embd)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, n_embd)
        x = tok_emb + pos_emb                                               # (B, T, n_embd)

        # Process through transformer blocks
        x = self.blocks(x)                                                  # (B, T, n_embd)

        # Final layer normalization
        x = self.ln_f(x)                                                    # (B, T, n_embd)

        # Project to vocabulary logits
        logits = self.lm_head(x)                                            # (B, T, vocab_size)

        # Compute cross-entropy loss if targets provided
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        """
        Generate text autoregressively from the model.

        Args:
            idx: (B, 1) tensor of initial character indices (context)
            max_new_tokens: number of tokens to generate

        Returns:
            (B, T + max_new_tokens) tensor of generated character indices
        """
        for _ in range(max_new_tokens):
            # Crop context to block_size (only attend to recent tokens)
            idx_cond = idx[:, -block_size:]

            # Get model predictions
            logits, _ = self(idx_cond)

            # Focus on the last time step
            logits = logits[:, -1, :]  # (B, vocab_size)

            # Convert to probabilities
            probs = F.softmax(logits, dim=-1)

            # Sample next token from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)

            # Append to sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)

        return idx

Note: Pre-Norm vs. Post-Norm

You may notice a structural difference between the architectural diagram (which depicts Post-Norm) and our implementation (which uses Pre-Norm).

  • Post-Norm: x = LayerNorm(x + Sublayer(x)) (Original Transformer paper).
  • Pre-Norm: x = x + Sublayer(LayerNorm(x)) (Used in this code, GPT-2/3, Llama).

Modern LLMs predominantly favour Pre-Norm because it stabilizes gradients during training, allowing for deeper networks without convergence issues.

Section 6: Training Loop

Now we instantiate the model and train it:

# Initialize model
model = DecoderOnlyGPT()
m = model.to(device)

# Print model size
print(f"{sum(p.numel() for p in m.parameters()) / 1e6:.2f}M parameters")

# Create optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for iter in range(max_iters):
    # Evaluate and log loss periodically
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter:4d}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Get batch of training data
    xb, yb = get_batch('train')

    # Forward pass and optimization step
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# ==================== Generation ====================
# Generate new text from a random starting point
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = m.generate(context, max_new_tokens=2000)
print(decode(generated_text[0].tolist()))

Summary

This implementation builds a complete decoder-only GPT model from scratch, handling:

  • Tokenization & Embedding: Converting raw text to numerical representations
  • Self-Attention: Enabling the model to learn dependencies between tokens
  • Multi-Head Attention: Allowing parallel focus on different aspects
  • Position Encoding: Preserving sequence order information
  • Feed-Forward Networks: Introducing non-linearity and learning capacity
  • Residual Connections: Facilitating training of deep networks
  • Layer Normalization: Stabilizing the training process
  • Autoregressive Generation: Sampling tokens sequentially to produce new text

The pretraining phase demonstrated here minimizes the cross-entropy loss between predicted and actual next tokens, learning the statistical patterns of the training data. This foundation enables the model to generate coherent text and serves as the basis for downstream applications.

In upcoming episodes, we'll explore fine-tuning strategies, evaluation metrics, and how to scale this architecture to larger datasets and model sizes.

Here are the key resources used in this implementation and for further study:

Thanks for reading. Stay curious!