LitGPT Model Architecture Toolkit

Overview

LitGPT is a comprehensive toolkit from Lightning AI that provides clean, hackable implementations of 20+ large language model architectures with production-ready training, fine-tuning, and deployment workflows. Unlike monolithic frameworks, LitGPT prioritizes readable code and modular design, making it ideal for researchers who want to understand model internals while having access to battle-tested training recipes. It supports the full lifecycle from pretraining through LoRA fine-tuning to quantized deployment, with first-class support for multi-GPU training via FSDP.

When to Use

Understanding LLM architectures: You want clean, readable implementations of GPT, LLaMA, Gemma, Phi, Mistral, and other model families
Fine-tuning with LoRA: You need memory-efficient adapter-based fine-tuning on a single GPU (12-16 GB VRAM)
Full fine-tuning: You have sufficient hardware (40+ GB VRAM) and want to update all model weights
Pretraining from scratch: You want to train a new language model on custom domain data
Model deployment: You need to convert, quantize, and serve LLM checkpoints with a simple API
Educational purposes: You are learning transformer architectures and need clear reference implementations

Choose alternatives when: you need the broadest model support (HuggingFace Transformers), maximum distributed training performance at 70B+ scale (Megatron-Core), or inference-only serving (vLLM).

Quick Start


# Install LitGPT with all extras
pip install 'litgpt[extra]'

# List all available models
litgpt download list

# Download a model
litgpt download microsoft/phi-2

# Run inference
python -c "
from litgpt import LLM
llm = LLM.load('microsoft/phi-2')
print(llm.generate('Explain quantum computing in simple terms:', max_new_tokens=100))
"

Core Concepts

Model Loading and Inference


from litgpt import LLM

# Load any supported model
llm = LLM.load("microsoft/phi-2")

# Basic generation
result = llm.generate(
    "What is the capital of France?",
    max_new_tokens=50,
    temperature=0.7,
    top_k=40
)
print(result)

# Streaming generation
for token in llm.generate("Explain neural networks:", stream=True):
    print(token, end="", flush=True)

# Batch inference
prompts = [
    "Translate to French: Hello",
    "Summarize: Machine learning is...",
    "Write a haiku about coding"
]
results = [llm.generate(p, max_new_tokens=50) for p in prompts]

LoRA Fine-Tuning (Single GPU)

The most memory-efficient approach for customizing a pretrained model.


# Step 1: Download base model
litgpt download microsoft/phi-2

# Step 2: Prepare Alpaca-format dataset
# data/my_dataset.json should contain:
# [{"instruction": "...", "input": "", "output": "..."}, ...]

# Step 3: Run LoRA fine-tuning
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \
  --lora_query true \
  --lora_value true \
  --lora_projection true \
  --train.epochs 3 \
  --train.learning_rate 1e-4 \
  --train.micro_batch_size 4 \
  --train.global_batch_size 32 \
  --out_dir out/phi2-lora


# Step 4: Merge LoRA adapters into base model
# litgpt merge_lora out/phi2-lora/final --out_dir out/phi2-merged

# Step 5: Use the fine-tuned model
from litgpt import LLM
llm = LLM.load("out/phi2-merged")
print(llm.generate("Your domain-specific prompt here"))

Full Fine-Tuning


# Requires 40GB+ GPU for 7B models
litgpt finetune \
  meta-llama/Meta-Llama-3-8B \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --train.max_steps 1000 \
  --train.learning_rate 2e-5 \
  --train.micro_batch_size 1 \
  --train.global_batch_size 16

Pretraining from Scratch


# Step 1: Prepare tokenized data
python scripts/prepare_dataset.py \
  --source_path data/my_corpus.txt \
  --checkpoint_dir checkpoints/tokenizer \
  --destination_path data/pretrain \
  --split train,val

# Step 2: Configure architecture
# config/custom-model.yaml
# model_name: custom-160m
# block_size: 2048
# vocab_size: 50304
# n_layer: 12
# n_head: 12
# n_embd: 768

# Step 3: Launch pretraining
litgpt pretrain \
  --config config/custom-model.yaml \
  --data.data_dir data/pretrain \
  --train.max_tokens 10_000_000_000

# Multi-GPU with FSDP
litgpt pretrain \
  --config config/custom-model.yaml \
  --data.data_dir data/pretrain \
  --devices 8 \
  --train.max_tokens 100_000_000_000

Quantization and Deployment


# Quantize model for smaller deployment
# litgpt convert_lit_checkpoint out/phi2-lora/final --quantize bnb.nf4

# Deploy with FastAPI
from fastapi import FastAPI
from litgpt import LLM

app = FastAPI()
llm = LLM.load("out/phi2-merged")

@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
    result = llm.generate(prompt, max_new_tokens=max_tokens, temperature=0.7)
    return {"response": result}

# Run: uvicorn api:app --host 0.0.0.0 --port 8000

Configuration Reference

LoRA Parameter	Typical Value	Description
`lora_r`	8, 16, 32, 64	LoRA rank (higher = more capacity, larger adapters)
`lora_alpha`	2x lora_r	LoRA scaling factor
`lora_dropout`	0.05	Dropout for LoRA layers
`lora_query`	true	Apply LoRA to attention query projection
`lora_key`	false	Apply LoRA to attention key projection
`lora_value`	true	Apply LoRA to attention value projection
`lora_projection`	true	Apply LoRA to output projection
`lora_mlp`	false	Apply LoRA to MLP layers
`lora_head`	false	Apply LoRA to language model head

Training Parameter	Default	Description
`train.learning_rate`	1e-4 (LoRA), 2e-5 (full)	Optimizer learning rate
`train.micro_batch_size`	1-4	Per-device batch size
`train.global_batch_size`	16-64	Effective batch size across accumulation steps
`train.max_steps`	1000	Maximum training steps
`train.epochs`	3	Number of training epochs
`train.gradient_accumulation_iters`	auto	Gradient accumulation steps

Hardware Requirement	VRAM Needed
Inference (Phi-2, 2.7B)	6 GB
LoRA fine-tuning (7B)	16 GB
Full fine-tuning (7B)	40+ GB
Pretraining (1B)	24 GB

Best Practices

Start with LoRA before full fine-tuning: LoRA achieves 90-95% of full fine-tuning quality at a fraction of the memory cost, making it the default choice for most fine-tuning tasks.
Choose the right LoRA rank: Use r=8 for lightweight adapters (2-4 MB), r=16 for standard quality, r=32 for complex tasks, and r=64 only when maximum capacity is needed.
Use gradient accumulation for large effective batches: Set micro_batch_size=1 and increase gradient_accumulation_iters to simulate large batches on limited VRAM.
Monitor with TensorBoard: Add --train.logger_name tensorboard to any training command and run tensorboard --logdir out/ to visualize training curves.
Validate early and often: Set --eval_interval to check validation loss frequently during training, catching overfitting before wasting compute.
Use Alpaca format consistently: Structure all instruction datasets as {"instruction", "input", "output"} JSON for compatibility with LitGPT data loaders.
Merge LoRA weights for deployment: Always merge adapters into the base model before production deployment to avoid the overhead of adapter loading at inference time.
Enable Flash Attention automatically: LitGPT enables Flash Attention on Ampere+ GPUs (A100, RTX 30/40 series) by default with no configuration needed.
Use BFloat16 precision: Set --precision bf16-true for training stability on modern GPUs, avoiding the numerical issues of float16.
Test on small models first: Prototype your data pipeline and training configuration on Phi-2 (2.7B) or Llama-3-1B before scaling to larger models.

Troubleshooting

Out of memory during fine-tuning Switch from litgpt finetune to litgpt finetune_lora. If already using LoRA, reduce micro_batch_size to 1 and increase gradient_accumulation_iters. Reduce lora_r from 16 to 8.

Model not found after download Run ls checkpoints/ to verify the directory structure. Model names are case-sensitive: use the exact name from litgpt download list.

Training loss not decreasing Lower the learning rate (try 3e-5 for full fine-tuning, 5e-5 for LoRA). Verify your dataset format matches the expected Alpaca JSON structure. Check that the dataset is not empty or corrupt.

LoRA adapters too large Reduce lora_r to 8. Disable LoRA on less critical layers: set lora_projection false and lora_mlp false.

Slow training speed Ensure PyTorch 2.0+ is installed for automatic compilation. Verify GPU utilization with nvidia-smi. Use --train.micro_batch_size as large as VRAM allows.

Generation quality poor after fine-tuning Train for more steps or epochs. Increase dataset size (minimum 1000 examples recommended). Lower temperature during inference (0.5-0.7).

FSDP multi-GPU errors Ensure all GPUs have the same VRAM. Use torchrun --nproc_per_node=N for multi-GPU launches. Check NCCL connectivity with NCCL_DEBUG=INFO.

⚠️ Loading Issue

Model Architecture Litgpt Toolkit

LitGPT Model Architecture Toolkit

Overview

When to Use

Quick Start

Core Concepts

Model Loading and Inference

LoRA Fine-Tuning (Single GPU)

Full Fine-Tuning

Pretraining from Scratch

Quantization and Deployment

Configuration Reference

Best Practices

Troubleshooting

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace