Advanced Model Architecture Nanogpt
Comprehensive skill designed for educational, implementation, lines, reproduces. Includes structured workflows, validation checks, and reusable patterns for ai research.
Advanced Model Architecture with NanoGPT
Overview
NanoGPT is Andrej Karpathy's minimalist GPT implementation that distills the transformer architecture down to approximately 300 lines of model code and 300 lines of training code. Despite its simplicity, NanoGPT can reproduce GPT-2 (124M parameters) on OpenWebText with competitive performance, making it the definitive educational and prototyping tool for understanding how GPT models work at the implementation level. This template covers everything from character-level Shakespeare training to full GPT-2 reproduction, custom dataset preparation, architecture modifications, and advanced training strategies.
When to Use
- Learning transformers: You want to understand every line of a GPT implementation without framework abstractions
- Rapid prototyping: You need to test a new architectural idea (attention variant, activation function, normalization) quickly
- Small-scale experiments: You want to train character-level or BPE-tokenized models on custom data with minimal setup
- Educational settings: You are teaching or taking a course on deep learning and need a clear reference implementation
- Architecture research: You want to modify transformer internals and measure the impact directly
- CPU-friendly training: You need a model small enough to train on a laptop CPU in minutes
Choose alternatives when: you need production fine-tuning workflows (LitGPT, Axolotl), maximum distributed training performance (Megatron-LM), or inference serving (vLLM).
Quick Start
# Clone the repository git clone https://github.com/karpathy/nanoGPT.git cd nanoGPT # Install dependencies pip install torch numpy transformers datasets tiktoken wandb tqdm # Prepare Shakespeare character-level dataset python data/shakespeare_char/prepare.py # Train (runs in ~5 minutes on CPU) python train.py config/train_shakespeare_char.py # Generate text from trained model python sample.py --out_dir=out-shakespeare-char
Expected output:
ROMEO:
What say'st thou? Shall I speak, and be a man?
JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man...
Core Concepts
The GPT Model (~300 lines)
NanoGPT's model.py implements the complete GPT architecture in pure PyTorch.
import torch import torch.nn as nn import torch.nn.functional as F import math class CausalSelfAttention(nn.Module): """Multi-head causal (masked) self-attention.""" def __init__(self, config): super().__init__() assert config.n_embd % config.n_head == 0 self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd) # Q, K, V combined self.c_proj = nn.Linear(config.n_embd, config.n_embd) # Output projection self.n_head = config.n_head self.n_embd = config.n_embd self.dropout = config.dropout def forward(self, x): B, T, C = x.size() # batch, sequence length, embedding dim # Compute Q, K, V for all heads in one linear pass q, k, v = self.c_attn(x).split(self.n_embd, dim=2) q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # Scaled dot-product attention with causal mask y = F.scaled_dot_product_attention( q, k, v, is_causal=True, dropout_p=self.dropout if self.training else 0 ) y = y.transpose(1, 2).contiguous().view(B, T, C) return self.c_proj(y) class MLP(nn.Module): """Feed-forward network with GELU activation.""" def __init__(self, config): super().__init__() self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd) self.gelu = nn.GELU() self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd) def forward(self, x): return self.c_proj(self.gelu(self.c_fc(x))) class Block(nn.Module): """Transformer block: LayerNorm -> Attention -> Residual -> LayerNorm -> MLP -> Residual""" def __init__(self, config): super().__init__() self.ln_1 = nn.LayerNorm(config.n_embd) self.attn = CausalSelfAttention(config) self.ln_2 = nn.LayerNorm(config.n_embd) self.mlp = MLP(config) def forward(self, x): x = x + self.attn(self.ln_1(x)) x = x + self.mlp(self.ln_2(x)) return x
Training Configurations
# Character-level Shakespeare (CPU-friendly, 5 min) # config/train_shakespeare_char.py n_layer = 6 n_head = 6 n_embd = 384 block_size = 256 # Context window (characters) batch_size = 64 learning_rate = 1e-3 max_iters = 5000 eval_interval = 500 device = 'cpu' compile = False # GPT-2 124M reproduction (8x A100, ~4 days) # config/train_gpt2.py n_layer = 12 n_head = 12 n_embd = 768 block_size = 1024 # Context window (BPE tokens) batch_size = 12 gradient_accumulation_steps = 40 # Effective batch ~0.5M tokens learning_rate = 6e-4 max_iters = 600000 lr_decay_iters = 600000 compile = True # PyTorch 2.0 compilation
Custom Dataset Preparation
# data/custom/prepare.py - Character-level tokenization import numpy as np with open('my_data.txt', 'r') as f: text = f.read() # Build character vocabulary chars = sorted(list(set(text))) vocab_size = len(chars) stoi = {ch: i for i, ch in enumerate(chars)} itos = {i: ch for i, ch in enumerate(chars)} # Encode entire text data = np.array([stoi[ch] for ch in text], dtype=np.uint16) # Train/validation split (90/10) n = len(data) train_data = data[:int(n * 0.9)] val_data = data[int(n * 0.9):] # Save as binary files train_data.tofile('data/custom/train.bin') val_data.tofile('data/custom/val.bin') # Save metadata import json meta = {'vocab_size': vocab_size, 'itos': itos, 'stoi': stoi} with open('data/custom/meta.pkl', 'wb') as f: import pickle pickle.dump(meta, f)
# BPE tokenization using tiktoken (for larger models) import tiktoken import numpy as np enc = tiktoken.get_encoding("gpt2") text = open('my_data.txt', 'r').read() # Encode with BPE tokens = enc.encode(text) data = np.array(tokens, dtype=np.uint16) # Split and save n = len(data) np.array(data[:int(n*0.9)], dtype=np.uint16).tofile('data/custom/train.bin') np.array(data[int(n*0.9):], dtype=np.uint16).tofile('data/custom/val.bin')
Fine-Tuning Pretrained GPT-2
# config/finetune_shakespeare.py init_from = 'gpt2' # Load OpenAI GPT-2 weights dataset = 'shakespeare_char' batch_size = 1 block_size = 1024 # Fine-tuning hyperparameters (lower LR than pretraining) learning_rate = 3e-5 max_iters = 2000 warmup_iters = 100 weight_decay = 1e-1 # GPT-2 variants available: # init_from = 'gpt2' # 124M parameters # init_from = 'gpt2-medium' # 350M parameters # init_from = 'gpt2-large' # 774M parameters # init_from = 'gpt2-xl' # 1558M parameters
Distributed Training with DDP
# Multi-GPU training with PyTorch Distributed Data Parallel torchrun --standalone --nproc_per_node=8 \ train.py config/train_gpt2.py # The training script auto-detects DDP and adjusts: # - gradient_accumulation_steps divided by world_size # - Each GPU processes different data slices # - Gradients synchronized via all-reduce
Sampling and Generation
# Generate from trained model python sample.py \ --out_dir=out-shakespeare-char \ --num_samples=5 \ --max_new_tokens=500 \ --temperature=0.8 \ --top_k=200 # Generate from pretrained GPT-2 python sample.py \ --init_from=gpt2 \ --start="The meaning of life is" \ --num_samples=3
Configuration Reference
| Parameter | Shakespeare (char) | GPT-2 124M | Fine-tune GPT-2 |
|---|---|---|---|
n_layer | 6 | 12 | 12 (frozen) |
n_head | 6 | 12 | 12 (frozen) |
n_embd | 384 | 768 | 768 (frozen) |
block_size | 256 | 1024 | 1024 |
batch_size | 64 | 12 | 1 |
learning_rate | 1e-3 | 6e-4 | 3e-5 |
max_iters | 5000 | 600000 | 2000 |
device | cpu | cuda | cuda |
compile | False | True | True |
| Hardware Target | Training Time | VRAM Required |
|---|---|---|
| Shakespeare (CPU) | ~5 minutes | <1 GB |
| Shakespeare (T4 GPU) | ~1 minute | <1 GB |
| GPT-2 124M (1x A100) | ~1 week | ~16 GB |
| GPT-2 124M (8x A100) | ~4 days | ~16 GB/GPU |
| GPT-2 Medium 350M (8x A100) | ~2 weeks | ~40 GB/GPU |
| Sampling Parameter | Default | Description |
|---|---|---|
temperature | 0.8 | Higher = more random, lower = more deterministic |
top_k | 200 | Keep only top-k tokens; 0 = no filtering |
max_new_tokens | 500 | Number of tokens to generate |
num_samples | 10 | Number of independent completions |
Best Practices
-
Start with character-level Shakespeare: Always validate your setup on the Shakespeare dataset first -- it trains in minutes and exposes configuration issues immediately.
-
Use PyTorch 2.0 compilation: Set
compile=Truefor a free 2x speedup on compatible GPUs. Requires PyTorch 2.0+ and a CUDA-capable GPU. -
Use BFloat16 for numerical stability: Set
dtype='bfloat16'instead offloat16to avoid loss spikes from numerical overflow during training. -
Adjust gradient accumulation for effective batch size: When reducing
batch_sizefor memory, increasegradient_accumulation_stepsproportionally to maintain the same effective batch. -
Lower learning rate for fine-tuning: Use 10-20x lower learning rate (3e-5) when fine-tuning pretrained weights compared to pretraining from scratch (6e-4).
-
Add warmup for pretrained model fine-tuning: Set
warmup_itersto 100-200 to avoid destabilizing pretrained weights with large initial gradients. -
Monitor validation loss, not training loss: Check
eval_intervalregularly and stop training when validation loss stops improving to avoid overfitting. -
Use lower temperature for factual generation: Temperature 0.7-0.8 produces coherent text. Temperature 1.0+ produces creative but less reliable output.
-
Experiment with architecture changes in model.py: NanoGPT's simplicity makes it ideal for testing new attention mechanisms, activation functions, or normalization strategies.
-
Use wandb for experiment tracking: NanoGPT has built-in W&B integration. Set
wandb_log=Trueandwandb_project='nanoGPT'in your config.
Troubleshooting
CUDA out of memory
Reduce batch_size to 1 and increase gradient_accumulation_steps. Reduce block_size from 1024 to 512. Use dtype='bfloat16' for 50% memory reduction.
Training too slow on CPU
Set compile=False (compilation is GPU-only). Reduce model size: n_layer=4, n_head=4, n_embd=256. Reduce block_size to 128.
Poor generation quality
Train for more iterations (increase max_iters). Lower sampling temperature to 0.7. Add top_k=200 filtering. Ensure dataset is large enough (minimum 1MB of text).
Cannot load GPT-2 weights
Install the transformers library: pip install transformers. Verify the model name: valid options are gpt2, gpt2-medium, gpt2-large, gpt2-xl.
Loss not decreasing
Verify data files exist and are non-empty: check data/custom/train.bin size. Reduce learning rate if loss oscillates. Check that vocab_size in config matches the dataset vocabulary.
DDP training hangs
Ensure all GPUs are visible: echo $CUDA_VISIBLE_DEVICES. Use torchrun instead of manual torch.distributed.launch. Check NCCL with NCCL_DEBUG=INFO.
Generated text is repetitive
Increase temperature above 0.8. Add or increase top_k sampling. Train for more iterations to improve model quality. Use a larger model if the dataset is complex.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.