Model Architecture Litgpt Toolkit
Enterprise-grade skill for implements, trains, llms, using. Includes structured workflows, validation checks, and reusable patterns for ai research.
LitGPT Model Architecture Toolkit
Overview
LitGPT is a comprehensive toolkit from Lightning AI that provides clean, hackable implementations of 20+ large language model architectures with production-ready training, fine-tuning, and deployment workflows. Unlike monolithic frameworks, LitGPT prioritizes readable code and modular design, making it ideal for researchers who want to understand model internals while having access to battle-tested training recipes. It supports the full lifecycle from pretraining through LoRA fine-tuning to quantized deployment, with first-class support for multi-GPU training via FSDP.
When to Use
- Understanding LLM architectures: You want clean, readable implementations of GPT, LLaMA, Gemma, Phi, Mistral, and other model families
- Fine-tuning with LoRA: You need memory-efficient adapter-based fine-tuning on a single GPU (12-16 GB VRAM)
- Full fine-tuning: You have sufficient hardware (40+ GB VRAM) and want to update all model weights
- Pretraining from scratch: You want to train a new language model on custom domain data
- Model deployment: You need to convert, quantize, and serve LLM checkpoints with a simple API
- Educational purposes: You are learning transformer architectures and need clear reference implementations
Choose alternatives when: you need the broadest model support (HuggingFace Transformers), maximum distributed training performance at 70B+ scale (Megatron-Core), or inference-only serving (vLLM).
Quick Start
# Install LitGPT with all extras pip install 'litgpt[extra]' # List all available models litgpt download list # Download a model litgpt download microsoft/phi-2 # Run inference python -c " from litgpt import LLM llm = LLM.load('microsoft/phi-2') print(llm.generate('Explain quantum computing in simple terms:', max_new_tokens=100)) "
Core Concepts
Model Loading and Inference
from litgpt import LLM # Load any supported model llm = LLM.load("microsoft/phi-2") # Basic generation result = llm.generate( "What is the capital of France?", max_new_tokens=50, temperature=0.7, top_k=40 ) print(result) # Streaming generation for token in llm.generate("Explain neural networks:", stream=True): print(token, end="", flush=True) # Batch inference prompts = [ "Translate to French: Hello", "Summarize: Machine learning is...", "Write a haiku about coding" ] results = [llm.generate(p, max_new_tokens=50) for p in prompts]
LoRA Fine-Tuning (Single GPU)
The most memory-efficient approach for customizing a pretrained model.
# Step 1: Download base model litgpt download microsoft/phi-2 # Step 2: Prepare Alpaca-format dataset # data/my_dataset.json should contain: # [{"instruction": "...", "input": "", "output": "..."}, ...] # Step 3: Run LoRA fine-tuning litgpt finetune_lora \ microsoft/phi-2 \ --data JSON \ --data.json_path data/my_dataset.json \ --lora_r 16 \ --lora_alpha 32 \ --lora_dropout 0.05 \ --lora_query true \ --lora_value true \ --lora_projection true \ --train.epochs 3 \ --train.learning_rate 1e-4 \ --train.micro_batch_size 4 \ --train.global_batch_size 32 \ --out_dir out/phi2-lora
# Step 4: Merge LoRA adapters into base model # litgpt merge_lora out/phi2-lora/final --out_dir out/phi2-merged # Step 5: Use the fine-tuned model from litgpt import LLM llm = LLM.load("out/phi2-merged") print(llm.generate("Your domain-specific prompt here"))
Full Fine-Tuning
# Requires 40GB+ GPU for 7B models litgpt finetune \ meta-llama/Meta-Llama-3-8B \ --data JSON \ --data.json_path data/my_dataset.json \ --train.max_steps 1000 \ --train.learning_rate 2e-5 \ --train.micro_batch_size 1 \ --train.global_batch_size 16
Pretraining from Scratch
# Step 1: Prepare tokenized data python scripts/prepare_dataset.py \ --source_path data/my_corpus.txt \ --checkpoint_dir checkpoints/tokenizer \ --destination_path data/pretrain \ --split train,val # Step 2: Configure architecture # config/custom-model.yaml # model_name: custom-160m # block_size: 2048 # vocab_size: 50304 # n_layer: 12 # n_head: 12 # n_embd: 768 # Step 3: Launch pretraining litgpt pretrain \ --config config/custom-model.yaml \ --data.data_dir data/pretrain \ --train.max_tokens 10_000_000_000 # Multi-GPU with FSDP litgpt pretrain \ --config config/custom-model.yaml \ --data.data_dir data/pretrain \ --devices 8 \ --train.max_tokens 100_000_000_000
Quantization and Deployment
# Quantize model for smaller deployment # litgpt convert_lit_checkpoint out/phi2-lora/final --quantize bnb.nf4 # Deploy with FastAPI from fastapi import FastAPI from litgpt import LLM app = FastAPI() llm = LLM.load("out/phi2-merged") @app.post("/generate") def generate(prompt: str, max_tokens: int = 100): result = llm.generate(prompt, max_new_tokens=max_tokens, temperature=0.7) return {"response": result} # Run: uvicorn api:app --host 0.0.0.0 --port 8000
Configuration Reference
| LoRA Parameter | Typical Value | Description |
|---|---|---|
lora_r | 8, 16, 32, 64 | LoRA rank (higher = more capacity, larger adapters) |
lora_alpha | 2x lora_r | LoRA scaling factor |
lora_dropout | 0.05 | Dropout for LoRA layers |
lora_query | true | Apply LoRA to attention query projection |
lora_key | false | Apply LoRA to attention key projection |
lora_value | true | Apply LoRA to attention value projection |
lora_projection | true | Apply LoRA to output projection |
lora_mlp | false | Apply LoRA to MLP layers |
lora_head | false | Apply LoRA to language model head |
| Training Parameter | Default | Description |
|---|---|---|
train.learning_rate | 1e-4 (LoRA), 2e-5 (full) | Optimizer learning rate |
train.micro_batch_size | 1-4 | Per-device batch size |
train.global_batch_size | 16-64 | Effective batch size across accumulation steps |
train.max_steps | 1000 | Maximum training steps |
train.epochs | 3 | Number of training epochs |
train.gradient_accumulation_iters | auto | Gradient accumulation steps |
| Hardware Requirement | VRAM Needed |
|---|---|
| Inference (Phi-2, 2.7B) | 6 GB |
| LoRA fine-tuning (7B) | 16 GB |
| Full fine-tuning (7B) | 40+ GB |
| Pretraining (1B) | 24 GB |
Best Practices
-
Start with LoRA before full fine-tuning: LoRA achieves 90-95% of full fine-tuning quality at a fraction of the memory cost, making it the default choice for most fine-tuning tasks.
-
Choose the right LoRA rank: Use
r=8for lightweight adapters (2-4 MB),r=16for standard quality,r=32for complex tasks, andr=64only when maximum capacity is needed. -
Use gradient accumulation for large effective batches: Set
micro_batch_size=1and increasegradient_accumulation_itersto simulate large batches on limited VRAM. -
Monitor with TensorBoard: Add
--train.logger_name tensorboardto any training command and runtensorboard --logdir out/to visualize training curves. -
Validate early and often: Set
--eval_intervalto check validation loss frequently during training, catching overfitting before wasting compute. -
Use Alpaca format consistently: Structure all instruction datasets as
{"instruction", "input", "output"}JSON for compatibility with LitGPT data loaders. -
Merge LoRA weights for deployment: Always merge adapters into the base model before production deployment to avoid the overhead of adapter loading at inference time.
-
Enable Flash Attention automatically: LitGPT enables Flash Attention on Ampere+ GPUs (A100, RTX 30/40 series) by default with no configuration needed.
-
Use BFloat16 precision: Set
--precision bf16-truefor training stability on modern GPUs, avoiding the numerical issues of float16. -
Test on small models first: Prototype your data pipeline and training configuration on Phi-2 (2.7B) or Llama-3-1B before scaling to larger models.
Troubleshooting
Out of memory during fine-tuning
Switch from litgpt finetune to litgpt finetune_lora. If already using LoRA, reduce micro_batch_size to 1 and increase gradient_accumulation_iters. Reduce lora_r from 16 to 8.
Model not found after download
Run ls checkpoints/ to verify the directory structure. Model names are case-sensitive: use the exact name from litgpt download list.
Training loss not decreasing Lower the learning rate (try 3e-5 for full fine-tuning, 5e-5 for LoRA). Verify your dataset format matches the expected Alpaca JSON structure. Check that the dataset is not empty or corrupt.
LoRA adapters too large
Reduce lora_r to 8. Disable LoRA on less critical layers: set lora_projection false and lora_mlp false.
Slow training speed
Ensure PyTorch 2.0+ is installed for automatic compilation. Verify GPU utilization with nvidia-smi. Use --train.micro_batch_size as large as VRAM allows.
Generation quality poor after fine-tuning Train for more steps or epochs. Increase dataset size (minimum 1000 examples recommended). Lower temperature during inference (0.5-0.7).
FSDP multi-GPU errors
Ensure all GPUs have the same VRAM. Use torchrun --nproc_per_node=N for multi-GPU launches. Check NCCL connectivity with NCCL_DEBUG=INFO.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.