A

Advanced Model Architecture Nanogpt

Comprehensive skill designed for educational, implementation, lines, reproduces. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Advanced Model Architecture with NanoGPT

Overview

NanoGPT is Andrej Karpathy's minimalist GPT implementation that distills the transformer architecture down to approximately 300 lines of model code and 300 lines of training code. Despite its simplicity, NanoGPT can reproduce GPT-2 (124M parameters) on OpenWebText with competitive performance, making it the definitive educational and prototyping tool for understanding how GPT models work at the implementation level. This template covers everything from character-level Shakespeare training to full GPT-2 reproduction, custom dataset preparation, architecture modifications, and advanced training strategies.

When to Use

  • Learning transformers: You want to understand every line of a GPT implementation without framework abstractions
  • Rapid prototyping: You need to test a new architectural idea (attention variant, activation function, normalization) quickly
  • Small-scale experiments: You want to train character-level or BPE-tokenized models on custom data with minimal setup
  • Educational settings: You are teaching or taking a course on deep learning and need a clear reference implementation
  • Architecture research: You want to modify transformer internals and measure the impact directly
  • CPU-friendly training: You need a model small enough to train on a laptop CPU in minutes

Choose alternatives when: you need production fine-tuning workflows (LitGPT, Axolotl), maximum distributed training performance (Megatron-LM), or inference serving (vLLM).

Quick Start

# Clone the repository git clone https://github.com/karpathy/nanoGPT.git cd nanoGPT # Install dependencies pip install torch numpy transformers datasets tiktoken wandb tqdm # Prepare Shakespeare character-level dataset python data/shakespeare_char/prepare.py # Train (runs in ~5 minutes on CPU) python train.py config/train_shakespeare_char.py # Generate text from trained model python sample.py --out_dir=out-shakespeare-char

Expected output:

ROMEO:
What say'st thou? Shall I speak, and be a man?

JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man...

Core Concepts

The GPT Model (~300 lines)

NanoGPT's model.py implements the complete GPT architecture in pure PyTorch.

import torch import torch.nn as nn import torch.nn.functional as F import math class CausalSelfAttention(nn.Module): """Multi-head causal (masked) self-attention.""" def __init__(self, config): super().__init__() assert config.n_embd % config.n_head == 0 self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd) # Q, K, V combined self.c_proj = nn.Linear(config.n_embd, config.n_embd) # Output projection self.n_head = config.n_head self.n_embd = config.n_embd self.dropout = config.dropout def forward(self, x): B, T, C = x.size() # batch, sequence length, embedding dim # Compute Q, K, V for all heads in one linear pass q, k, v = self.c_attn(x).split(self.n_embd, dim=2) q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # Scaled dot-product attention with causal mask y = F.scaled_dot_product_attention( q, k, v, is_causal=True, dropout_p=self.dropout if self.training else 0 ) y = y.transpose(1, 2).contiguous().view(B, T, C) return self.c_proj(y) class MLP(nn.Module): """Feed-forward network with GELU activation.""" def __init__(self, config): super().__init__() self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd) self.gelu = nn.GELU() self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd) def forward(self, x): return self.c_proj(self.gelu(self.c_fc(x))) class Block(nn.Module): """Transformer block: LayerNorm -> Attention -> Residual -> LayerNorm -> MLP -> Residual""" def __init__(self, config): super().__init__() self.ln_1 = nn.LayerNorm(config.n_embd) self.attn = CausalSelfAttention(config) self.ln_2 = nn.LayerNorm(config.n_embd) self.mlp = MLP(config) def forward(self, x): x = x + self.attn(self.ln_1(x)) x = x + self.mlp(self.ln_2(x)) return x

Training Configurations

# Character-level Shakespeare (CPU-friendly, 5 min) # config/train_shakespeare_char.py n_layer = 6 n_head = 6 n_embd = 384 block_size = 256 # Context window (characters) batch_size = 64 learning_rate = 1e-3 max_iters = 5000 eval_interval = 500 device = 'cpu' compile = False # GPT-2 124M reproduction (8x A100, ~4 days) # config/train_gpt2.py n_layer = 12 n_head = 12 n_embd = 768 block_size = 1024 # Context window (BPE tokens) batch_size = 12 gradient_accumulation_steps = 40 # Effective batch ~0.5M tokens learning_rate = 6e-4 max_iters = 600000 lr_decay_iters = 600000 compile = True # PyTorch 2.0 compilation

Custom Dataset Preparation

# data/custom/prepare.py - Character-level tokenization import numpy as np with open('my_data.txt', 'r') as f: text = f.read() # Build character vocabulary chars = sorted(list(set(text))) vocab_size = len(chars) stoi = {ch: i for i, ch in enumerate(chars)} itos = {i: ch for i, ch in enumerate(chars)} # Encode entire text data = np.array([stoi[ch] for ch in text], dtype=np.uint16) # Train/validation split (90/10) n = len(data) train_data = data[:int(n * 0.9)] val_data = data[int(n * 0.9):] # Save as binary files train_data.tofile('data/custom/train.bin') val_data.tofile('data/custom/val.bin') # Save metadata import json meta = {'vocab_size': vocab_size, 'itos': itos, 'stoi': stoi} with open('data/custom/meta.pkl', 'wb') as f: import pickle pickle.dump(meta, f)
# BPE tokenization using tiktoken (for larger models) import tiktoken import numpy as np enc = tiktoken.get_encoding("gpt2") text = open('my_data.txt', 'r').read() # Encode with BPE tokens = enc.encode(text) data = np.array(tokens, dtype=np.uint16) # Split and save n = len(data) np.array(data[:int(n*0.9)], dtype=np.uint16).tofile('data/custom/train.bin') np.array(data[int(n*0.9):], dtype=np.uint16).tofile('data/custom/val.bin')

Fine-Tuning Pretrained GPT-2

# config/finetune_shakespeare.py init_from = 'gpt2' # Load OpenAI GPT-2 weights dataset = 'shakespeare_char' batch_size = 1 block_size = 1024 # Fine-tuning hyperparameters (lower LR than pretraining) learning_rate = 3e-5 max_iters = 2000 warmup_iters = 100 weight_decay = 1e-1 # GPT-2 variants available: # init_from = 'gpt2' # 124M parameters # init_from = 'gpt2-medium' # 350M parameters # init_from = 'gpt2-large' # 774M parameters # init_from = 'gpt2-xl' # 1558M parameters

Distributed Training with DDP

# Multi-GPU training with PyTorch Distributed Data Parallel torchrun --standalone --nproc_per_node=8 \ train.py config/train_gpt2.py # The training script auto-detects DDP and adjusts: # - gradient_accumulation_steps divided by world_size # - Each GPU processes different data slices # - Gradients synchronized via all-reduce

Sampling and Generation

# Generate from trained model python sample.py \ --out_dir=out-shakespeare-char \ --num_samples=5 \ --max_new_tokens=500 \ --temperature=0.8 \ --top_k=200 # Generate from pretrained GPT-2 python sample.py \ --init_from=gpt2 \ --start="The meaning of life is" \ --num_samples=3

Configuration Reference

ParameterShakespeare (char)GPT-2 124MFine-tune GPT-2
n_layer61212 (frozen)
n_head61212 (frozen)
n_embd384768768 (frozen)
block_size25610241024
batch_size64121
learning_rate1e-36e-43e-5
max_iters50006000002000
devicecpucudacuda
compileFalseTrueTrue
Hardware TargetTraining TimeVRAM Required
Shakespeare (CPU)~5 minutes<1 GB
Shakespeare (T4 GPU)~1 minute<1 GB
GPT-2 124M (1x A100)~1 week~16 GB
GPT-2 124M (8x A100)~4 days~16 GB/GPU
GPT-2 Medium 350M (8x A100)~2 weeks~40 GB/GPU
Sampling ParameterDefaultDescription
temperature0.8Higher = more random, lower = more deterministic
top_k200Keep only top-k tokens; 0 = no filtering
max_new_tokens500Number of tokens to generate
num_samples10Number of independent completions

Best Practices

  1. Start with character-level Shakespeare: Always validate your setup on the Shakespeare dataset first -- it trains in minutes and exposes configuration issues immediately.

  2. Use PyTorch 2.0 compilation: Set compile=True for a free 2x speedup on compatible GPUs. Requires PyTorch 2.0+ and a CUDA-capable GPU.

  3. Use BFloat16 for numerical stability: Set dtype='bfloat16' instead of float16 to avoid loss spikes from numerical overflow during training.

  4. Adjust gradient accumulation for effective batch size: When reducing batch_size for memory, increase gradient_accumulation_steps proportionally to maintain the same effective batch.

  5. Lower learning rate for fine-tuning: Use 10-20x lower learning rate (3e-5) when fine-tuning pretrained weights compared to pretraining from scratch (6e-4).

  6. Add warmup for pretrained model fine-tuning: Set warmup_iters to 100-200 to avoid destabilizing pretrained weights with large initial gradients.

  7. Monitor validation loss, not training loss: Check eval_interval regularly and stop training when validation loss stops improving to avoid overfitting.

  8. Use lower temperature for factual generation: Temperature 0.7-0.8 produces coherent text. Temperature 1.0+ produces creative but less reliable output.

  9. Experiment with architecture changes in model.py: NanoGPT's simplicity makes it ideal for testing new attention mechanisms, activation functions, or normalization strategies.

  10. Use wandb for experiment tracking: NanoGPT has built-in W&B integration. Set wandb_log=True and wandb_project='nanoGPT' in your config.

Troubleshooting

CUDA out of memory Reduce batch_size to 1 and increase gradient_accumulation_steps. Reduce block_size from 1024 to 512. Use dtype='bfloat16' for 50% memory reduction.

Training too slow on CPU Set compile=False (compilation is GPU-only). Reduce model size: n_layer=4, n_head=4, n_embd=256. Reduce block_size to 128.

Poor generation quality Train for more iterations (increase max_iters). Lower sampling temperature to 0.7. Add top_k=200 filtering. Ensure dataset is large enough (minimum 1MB of text).

Cannot load GPT-2 weights Install the transformers library: pip install transformers. Verify the model name: valid options are gpt2, gpt2-medium, gpt2-large, gpt2-xl.

Loss not decreasing Verify data files exist and are non-empty: check data/custom/train.bin size. Reduce learning rate if loss oscillates. Check that vocab_size in config matches the dataset vocabulary.

DDP training hangs Ensure all GPUs are visible: echo $CUDA_VISIBLE_DEVICES. Use torchrun instead of manual torch.distributed.launch. Check NCCL with NCCL_DEBUG=INFO.

Generated text is repetitive Increase temperature above 0.8. Add or increase top_k sampling. Train for more iterations to improve model quality. Use a larger model if the dataset is complex.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates