Advanced Model Architecture with NanoGPT

Overview

NanoGPT is Andrej Karpathy's minimalist GPT implementation that distills the transformer architecture down to approximately 300 lines of model code and 300 lines of training code. Despite its simplicity, NanoGPT can reproduce GPT-2 (124M parameters) on OpenWebText with competitive performance, making it the definitive educational and prototyping tool for understanding how GPT models work at the implementation level. This template covers everything from character-level Shakespeare training to full GPT-2 reproduction, custom dataset preparation, architecture modifications, and advanced training strategies.

When to Use

Learning transformers: You want to understand every line of a GPT implementation without framework abstractions
Rapid prototyping: You need to test a new architectural idea (attention variant, activation function, normalization) quickly
Small-scale experiments: You want to train character-level or BPE-tokenized models on custom data with minimal setup
Educational settings: You are teaching or taking a course on deep learning and need a clear reference implementation
Architecture research: You want to modify transformer internals and measure the impact directly
CPU-friendly training: You need a model small enough to train on a laptop CPU in minutes

Choose alternatives when: you need production fine-tuning workflows (LitGPT, Axolotl), maximum distributed training performance (Megatron-LM), or inference serving (vLLM).

Quick Start


# Clone the repository
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT

# Install dependencies
pip install torch numpy transformers datasets tiktoken wandb tqdm

# Prepare Shakespeare character-level dataset
python data/shakespeare_char/prepare.py

# Train (runs in ~5 minutes on CPU)
python train.py config/train_shakespeare_char.py

# Generate text from trained model
python sample.py --out_dir=out-shakespeare-char

Expected output:

ROMEO:
What say'st thou? Shall I speak, and be a man?

JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man...

Core Concepts

The GPT Model (~300 lines)

NanoGPT's model.py implements the complete GPT architecture in pure PyTorch.


import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class CausalSelfAttention(nn.Module):
    """Multi-head causal (masked) self-attention."""
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)  # Q, K, V combined
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)       # Output projection
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout

    def forward(self, x):
        B, T, C = x.size()  # batch, sequence length, embedding dim
        
        # Compute Q, K, V for all heads in one linear pass
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        
        # Scaled dot-product attention with causal mask
        y = F.scaled_dot_product_attention(
            q, k, v,
            is_causal=True,
            dropout_p=self.dropout if self.training else 0
        )
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.c_proj(y)


class MLP(nn.Module):
    """Feed-forward network with GELU activation."""
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu = nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        return self.c_proj(self.gelu(self.c_fc(x)))


class Block(nn.Module):
    """Transformer block: LayerNorm -> Attention -> Residual -> LayerNorm -> MLP -> Residual"""
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

Training Configurations


# Character-level Shakespeare (CPU-friendly, 5 min)
# config/train_shakespeare_char.py
n_layer = 6
n_head = 6
n_embd = 384
block_size = 256        # Context window (characters)
batch_size = 64
learning_rate = 1e-3
max_iters = 5000
eval_interval = 500
device = 'cpu'
compile = False

# GPT-2 124M reproduction (8x A100, ~4 days)
# config/train_gpt2.py
n_layer = 12
n_head = 12
n_embd = 768
block_size = 1024       # Context window (BPE tokens)
batch_size = 12
gradient_accumulation_steps = 40   # Effective batch ~0.5M tokens
learning_rate = 6e-4
max_iters = 600000
lr_decay_iters = 600000
compile = True          # PyTorch 2.0 compilation

Custom Dataset Preparation


# data/custom/prepare.py - Character-level tokenization
import numpy as np

with open('my_data.txt', 'r') as f:
    text = f.read()

# Build character vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Encode entire text
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)

# Train/validation split (90/10)
n = len(data)
train_data = data[:int(n * 0.9)]
val_data = data[int(n * 0.9):]

# Save as binary files
train_data.tofile('data/custom/train.bin')
val_data.tofile('data/custom/val.bin')

# Save metadata
import json
meta = {'vocab_size': vocab_size, 'itos': itos, 'stoi': stoi}
with open('data/custom/meta.pkl', 'wb') as f:
    import pickle
    pickle.dump(meta, f)


# BPE tokenization using tiktoken (for larger models)
import tiktoken
import numpy as np

enc = tiktoken.get_encoding("gpt2")
text = open('my_data.txt', 'r').read()

# Encode with BPE
tokens = enc.encode(text)
data = np.array(tokens, dtype=np.uint16)

# Split and save
n = len(data)
np.array(data[:int(n*0.9)], dtype=np.uint16).tofile('data/custom/train.bin')
np.array(data[int(n*0.9):], dtype=np.uint16).tofile('data/custom/val.bin')

Fine-Tuning Pretrained GPT-2


# config/finetune_shakespeare.py
init_from = 'gpt2'         # Load OpenAI GPT-2 weights
dataset = 'shakespeare_char'
batch_size = 1
block_size = 1024

# Fine-tuning hyperparameters (lower LR than pretraining)
learning_rate = 3e-5
max_iters = 2000
warmup_iters = 100
weight_decay = 1e-1

# GPT-2 variants available:
# init_from = 'gpt2'          # 124M parameters
# init_from = 'gpt2-medium'   # 350M parameters
# init_from = 'gpt2-large'    # 774M parameters
# init_from = 'gpt2-xl'       # 1558M parameters

Distributed Training with DDP


# Multi-GPU training with PyTorch Distributed Data Parallel
torchrun --standalone --nproc_per_node=8 \
  train.py config/train_gpt2.py

# The training script auto-detects DDP and adjusts:
# - gradient_accumulation_steps divided by world_size
# - Each GPU processes different data slices
# - Gradients synchronized via all-reduce

Sampling and Generation


# Generate from trained model
python sample.py \
  --out_dir=out-shakespeare-char \
  --num_samples=5 \
  --max_new_tokens=500 \
  --temperature=0.8 \
  --top_k=200

# Generate from pretrained GPT-2
python sample.py \
  --init_from=gpt2 \
  --start="The meaning of life is" \
  --num_samples=3

Configuration Reference

Parameter	Shakespeare (char)	GPT-2 124M	Fine-tune GPT-2
`n_layer`	6	12	12 (frozen)
`n_head`	6	12	12 (frozen)
`n_embd`	384	768	768 (frozen)
`block_size`	256	1024	1024
`batch_size`	64	12	1
`learning_rate`	1e-3	6e-4	3e-5
`max_iters`	5000	600000	2000
`device`	cpu	cuda	cuda
`compile`	False	True	True

Hardware Target	Training Time	VRAM Required
Shakespeare (CPU)	~5 minutes	<1 GB
Shakespeare (T4 GPU)	~1 minute	<1 GB
GPT-2 124M (1x A100)	~1 week	~16 GB
GPT-2 124M (8x A100)	~4 days	~16 GB/GPU
GPT-2 Medium 350M (8x A100)	~2 weeks	~40 GB/GPU

Sampling Parameter	Default	Description
`temperature`	0.8	Higher = more random, lower = more deterministic
`top_k`	200	Keep only top-k tokens; 0 = no filtering
`max_new_tokens`	500	Number of tokens to generate
`num_samples`	10	Number of independent completions

Best Practices

Start with character-level Shakespeare: Always validate your setup on the Shakespeare dataset first -- it trains in minutes and exposes configuration issues immediately.
Use PyTorch 2.0 compilation: Set compile=True for a free 2x speedup on compatible GPUs. Requires PyTorch 2.0+ and a CUDA-capable GPU.
Use BFloat16 for numerical stability: Set dtype='bfloat16' instead of float16 to avoid loss spikes from numerical overflow during training.
Adjust gradient accumulation for effective batch size: When reducing batch_size for memory, increase gradient_accumulation_steps proportionally to maintain the same effective batch.
Lower learning rate for fine-tuning: Use 10-20x lower learning rate (3e-5) when fine-tuning pretrained weights compared to pretraining from scratch (6e-4).
Add warmup for pretrained model fine-tuning: Set warmup_iters to 100-200 to avoid destabilizing pretrained weights with large initial gradients.
Monitor validation loss, not training loss: Check eval_interval regularly and stop training when validation loss stops improving to avoid overfitting.
Use lower temperature for factual generation: Temperature 0.7-0.8 produces coherent text. Temperature 1.0+ produces creative but less reliable output.
Experiment with architecture changes in model.py: NanoGPT's simplicity makes it ideal for testing new attention mechanisms, activation functions, or normalization strategies.
Use wandb for experiment tracking: NanoGPT has built-in W&B integration. Set wandb_log=True and wandb_project='nanoGPT' in your config.

Troubleshooting

CUDA out of memory Reduce batch_size to 1 and increase gradient_accumulation_steps. Reduce block_size from 1024 to 512. Use dtype='bfloat16' for 50% memory reduction.

Training too slow on CPU Set compile=False (compilation is GPU-only). Reduce model size: n_layer=4, n_head=4, n_embd=256. Reduce block_size to 128.

Poor generation quality Train for more iterations (increase max_iters). Lower sampling temperature to 0.7. Add top_k=200 filtering. Ensure dataset is large enough (minimum 1MB of text).

Cannot load GPT-2 weights Install the transformers library: pip install transformers. Verify the model name: valid options are gpt2, gpt2-medium, gpt2-large, gpt2-xl.

Loss not decreasing Verify data files exist and are non-empty: check data/custom/train.bin size. Reduce learning rate if loss oscillates. Check that vocab_size in config matches the dataset vocabulary.

DDP training hangs Ensure all GPUs are visible: echo $CUDA_VISIBLE_DEVICES. Use torchrun instead of manual torch.distributed.launch. Check NCCL with NCCL_DEBUG=INFO.

Generated text is repetitive Increase temperature above 0.8. Add or increase top_k sampling. Train for more iterations to improve model quality. Use a larger model if the dataset is complex.

⚠️ Loading Issue

Advanced Model Architecture Nanogpt