- Published on
Re: Implementation [01]: Decoder-Only GPT - Building a Character-Level Language Model
- Authors

- Name
- Shuqi Wang
Re: Implementation Series — Episode 01
Welcome to my open notebook. In this series, I am rebuilding the most influential models in AI history to prepare for my MPhil research. No black boxes, just code and first principles.
Overview
We begin with one of the most influential references for learning transformer-based language models: Andrej Karpathy's nanoGPT video and repository. This minimal, elegantly-designed implementation offers both a clear pedagogical pathway and a solid foundation for understanding how GPT-style models work.
In this post, we will focus specifically on the pretraining phase of a decoder-only, character-level GPT architecture. We'll walk through the theoretical foundations, dissect each component of the architecture, and then build the complete model step-by-step from scratch.
Architecture Overview
The decoder-only architecture represents the blueprint behind modern autoregressive language models. Unlike encoder-decoder architectures (e.g., T5), a decoder-only model processes text sequentially, predicting the next token based on all previous tokens in the sequence.
Below is a visual representation of the decoder-only transformer block structure:

Implementation: Building the Model Step-by-Step
Now, let's implement this architecture from first principles. The following code dissects the standard decoder-only architecture used in GPT-style models. We've refactored Andrej Karpathy's nanoGPT to use clearer naming conventions and added detailed comments explaining each component.
Section 1: Hyperparameters & Data Loading
First, we define our training hyperparameters and load/encode our text data:
import torch
import torch.nn as nn
from torch.nn import functional as F
# ==================== Hyperparameters ====================
batch_size = 16 # Number of independent sequences processed in parallel
block_size = 32 # Maximum context length for predictions (sequence length)
max_iters = 5000 # Total training iterations
eval_interval = 100 # Evaluate loss every N iterations
learning_rate = 1e-3 # Learning rate for AdamW optimizer
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200 # Number of iterations for loss estimation
n_embd = 64 # Embedding dimension (hidden size)
n_head = 4 # Number of attention heads
n_layer = 4 # Number of transformer blocks
dropout = 0.0 # Dropout rate
torch.manual_seed(1337)
# ==================== Data Loading ====================
# Load text data (download from: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Build vocabulary: character-to-integer and integer-to-character mappings
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)} # string to integer
itos = {i: ch for i, ch in enumerate(chars)} # integer to string
encode = lambda s: [stoi[c] for c in s] # encoder: string → list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: list of integers → string
# Convert entire text to tensor and split into train/val
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
# ==================== Batch Loading ====================
def get_batch(split):
"""Generate a small batch of data with inputs x and targets y."""
data = train_data if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i + block_size] for i in ix])
y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
x, y = x.to(device), y.to(device)
return x, y
@torch.no_grad()
def estimate_loss():
"""Estimate average loss on train and validation sets."""
out = {}
model.eval()
for split in ['train', 'val']:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
model.train()
return out
Section 2: Self-Attention Mechanism
The attention mechanism is the core of the transformer. Here, we implement a single attention head and multi-head attention:
class AttentionHead(nn.Module):
"""Single head of scaled dot-product self-attention."""
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
# Register a causal mask to prevent attending to future tokens
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B, T, C = x.shape
k = self.key(x) # (B, T, head_size)
q = self.query(x) # (B, T, head_size)
v = self.value(x) # (B, T, head_size)
# Compute attention scores: Q @ K^T / sqrt(d_k)
# Formula: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
wei = q @ k.transpose(-2, -1) * (C ** -0.5) # (B, T, T)
# Apply causal mask: prevent attention to future positions
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
# Weighted aggregation of values
out = wei @ v # (B, T, T) @ (B, T, head_size) → (B, T, head_size)
return out
class MultiHeadAttention(nn.Module):
"""Multiple attention heads running in parallel."""
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([AttentionHead(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(n_embd, n_embd) # Linear projection after concatenation
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Concatenate outputs from all heads
out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, n_embd)
# Project back to embedding dimension
out = self.dropout(self.proj(out))
return out
Mathematical Intuition: Each attention head learns to focus on different aspects of the input. By running them in parallel and combining their outputs, the model can capture diverse relationships in the data.
Section 3: Feed-Forward Network
After attention, the model applies a position-wise feed-forward network:
class FeedForwardNetwork(nn.Module):
"""Position-wise feed-forward network: Linear → ReLU → Linear."""
def __init__(self, n_embd):
super().__init__()
# Expand to 4x dimension, then contract back
# Formula: FFN(x) = ReLU(x * W1 + b1) * W2 + b2
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
Section 4: Transformer Block
A transformer block combines attention, layer normalization, and feed-forward components:
class TransformerBlock(nn.Module):
"""Transformer decoder block: Multi-head attention → FFN with residual connections and layer normalization."""
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.attention = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForwardNetwork(n_embd)
self.ln1 = nn.LayerNorm(n_embd) # Pre-norm layer normalization
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
# Residual connection around attention (Pre-LayerNorm architecture)
# Formula: x' = x + Attention(LayerNorm(x))
x = x + self.attention(self.ln1(x))
# Residual connection around FFN
# Formula: x'' = x' + FFN(LayerNorm(x'))
x = x + self.ffwd(self.ln2(x))
return x
Section 5: Complete GPT Model
Finally, we assemble all components into the full decoder-only GPT model:
class DecoderOnlyGPT(nn.Module):
"""Decoder-only GPT language model for character-level text generation."""
def __init__(self):
super().__init__()
# Token embedding: map character indices to embedding vectors
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
# Position embedding: encode absolute position in sequence
self.position_embedding_table = nn.Embedding(block_size, n_embd)
# Stack of transformer blocks
self.blocks = nn.Sequential(*[TransformerBlock(n_embd, n_head=n_head) for _ in range(n_layer)])
# Final layer normalization
self.ln_f = nn.LayerNorm(n_embd)
# Language modeling head: project to vocabulary size
self.lm_head = nn.Linear(n_embd, vocab_size)
def forward(self, idx, targets=None):
"""
Forward pass of the model.
Args:
idx: (B, T) tensor of character indices
targets: (B, T) tensor of target character indices (optional, for computing loss)
Returns:
logits: (B, T, vocab_size) tensor of log probabilities
loss: scalar loss value (None if targets not provided)
"""
B, T = idx.shape
# Embedding: combine token and position embeddings
tok_emb = self.token_embedding_table(idx) # (B, T, n_embd)
pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, n_embd)
x = tok_emb + pos_emb # (B, T, n_embd)
# Process through transformer blocks
x = self.blocks(x) # (B, T, n_embd)
# Final layer normalization
x = self.ln_f(x) # (B, T, n_embd)
# Project to vocabulary logits
logits = self.lm_head(x) # (B, T, vocab_size)
# Compute cross-entropy loss if targets provided
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
"""
Generate text autoregressively from the model.
Args:
idx: (B, 1) tensor of initial character indices (context)
max_new_tokens: number of tokens to generate
Returns:
(B, T + max_new_tokens) tensor of generated character indices
"""
for _ in range(max_new_tokens):
# Crop context to block_size (only attend to recent tokens)
idx_cond = idx[:, -block_size:]
# Get model predictions
logits, _ = self(idx_cond)
# Focus on the last time step
logits = logits[:, -1, :] # (B, vocab_size)
# Convert to probabilities
probs = F.softmax(logits, dim=-1)
# Sample next token from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# Append to sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
Note: Pre-Norm vs. Post-Norm
You may notice a structural difference between the architectural diagram (which depicts Post-Norm) and our implementation (which uses Pre-Norm).
- Post-Norm:
x = LayerNorm(x + Sublayer(x))(Original Transformer paper).- Pre-Norm:
x = x + Sublayer(LayerNorm(x))(Used in this code, GPT-2/3, Llama).Modern LLMs predominantly favour Pre-Norm because it stabilizes gradients during training, allowing for deeper networks without convergence issues.
Section 6: Training Loop
Now we instantiate the model and train it:
# Initialize model
model = DecoderOnlyGPT()
m = model.to(device)
# Print model size
print(f"{sum(p.numel() for p in m.parameters()) / 1e6:.2f}M parameters")
# Create optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
# Training loop
for iter in range(max_iters):
# Evaluate and log loss periodically
if iter % eval_interval == 0 or iter == max_iters - 1:
losses = estimate_loss()
print(f"step {iter:4d}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
# Get batch of training data
xb, yb = get_batch('train')
# Forward pass and optimization step
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
# ==================== Generation ====================
# Generate new text from a random starting point
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = m.generate(context, max_new_tokens=2000)
print(decode(generated_text[0].tolist()))
Summary
This implementation builds a complete decoder-only GPT model from scratch, handling:
- Tokenization & Embedding: Converting raw text to numerical representations
- Self-Attention: Enabling the model to learn dependencies between tokens
- Multi-Head Attention: Allowing parallel focus on different aspects
- Position Encoding: Preserving sequence order information
- Feed-Forward Networks: Introducing non-linearity and learning capacity
- Residual Connections: Facilitating training of deep networks
- Layer Normalization: Stabilizing the training process
- Autoregressive Generation: Sampling tokens sequentially to produce new text
The pretraining phase demonstrated here minimizes the cross-entropy loss between predicted and actual next tokens, learning the statistical patterns of the training data. This foundation enables the model to generate coherent text and serves as the basis for downstream applications.
In upcoming episodes, we'll explore fine-tuning strategies, evaluation metrics, and how to scale this architecture to larger datasets and model sizes.
Recommended Reading & References
Here are the key resources used in this implementation and for further study:
- Original Paper: Improving Language Understanding by Generative Pre-Training (GPT-1)
- Primary Source: Andrej Karpathy: Let's build GPT (Video)
- Code Reference:
- Further Reading: Building a Decoder-Only Transformer Model (Machine Learning Mastery)