M

Model Architecture Litgpt Toolkit

Enterprise-grade skill for implements, trains, llms, using. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

LitGPT Model Architecture Toolkit

Overview

LitGPT is a comprehensive toolkit from Lightning AI that provides clean, hackable implementations of 20+ large language model architectures with production-ready training, fine-tuning, and deployment workflows. Unlike monolithic frameworks, LitGPT prioritizes readable code and modular design, making it ideal for researchers who want to understand model internals while having access to battle-tested training recipes. It supports the full lifecycle from pretraining through LoRA fine-tuning to quantized deployment, with first-class support for multi-GPU training via FSDP.

When to Use

  • Understanding LLM architectures: You want clean, readable implementations of GPT, LLaMA, Gemma, Phi, Mistral, and other model families
  • Fine-tuning with LoRA: You need memory-efficient adapter-based fine-tuning on a single GPU (12-16 GB VRAM)
  • Full fine-tuning: You have sufficient hardware (40+ GB VRAM) and want to update all model weights
  • Pretraining from scratch: You want to train a new language model on custom domain data
  • Model deployment: You need to convert, quantize, and serve LLM checkpoints with a simple API
  • Educational purposes: You are learning transformer architectures and need clear reference implementations

Choose alternatives when: you need the broadest model support (HuggingFace Transformers), maximum distributed training performance at 70B+ scale (Megatron-Core), or inference-only serving (vLLM).

Quick Start

# Install LitGPT with all extras pip install 'litgpt[extra]' # List all available models litgpt download list # Download a model litgpt download microsoft/phi-2 # Run inference python -c " from litgpt import LLM llm = LLM.load('microsoft/phi-2') print(llm.generate('Explain quantum computing in simple terms:', max_new_tokens=100)) "

Core Concepts

Model Loading and Inference

from litgpt import LLM # Load any supported model llm = LLM.load("microsoft/phi-2") # Basic generation result = llm.generate( "What is the capital of France?", max_new_tokens=50, temperature=0.7, top_k=40 ) print(result) # Streaming generation for token in llm.generate("Explain neural networks:", stream=True): print(token, end="", flush=True) # Batch inference prompts = [ "Translate to French: Hello", "Summarize: Machine learning is...", "Write a haiku about coding" ] results = [llm.generate(p, max_new_tokens=50) for p in prompts]

LoRA Fine-Tuning (Single GPU)

The most memory-efficient approach for customizing a pretrained model.

# Step 1: Download base model litgpt download microsoft/phi-2 # Step 2: Prepare Alpaca-format dataset # data/my_dataset.json should contain: # [{"instruction": "...", "input": "", "output": "..."}, ...] # Step 3: Run LoRA fine-tuning litgpt finetune_lora \ microsoft/phi-2 \ --data JSON \ --data.json_path data/my_dataset.json \ --lora_r 16 \ --lora_alpha 32 \ --lora_dropout 0.05 \ --lora_query true \ --lora_value true \ --lora_projection true \ --train.epochs 3 \ --train.learning_rate 1e-4 \ --train.micro_batch_size 4 \ --train.global_batch_size 32 \ --out_dir out/phi2-lora
# Step 4: Merge LoRA adapters into base model # litgpt merge_lora out/phi2-lora/final --out_dir out/phi2-merged # Step 5: Use the fine-tuned model from litgpt import LLM llm = LLM.load("out/phi2-merged") print(llm.generate("Your domain-specific prompt here"))

Full Fine-Tuning

# Requires 40GB+ GPU for 7B models litgpt finetune \ meta-llama/Meta-Llama-3-8B \ --data JSON \ --data.json_path data/my_dataset.json \ --train.max_steps 1000 \ --train.learning_rate 2e-5 \ --train.micro_batch_size 1 \ --train.global_batch_size 16

Pretraining from Scratch

# Step 1: Prepare tokenized data python scripts/prepare_dataset.py \ --source_path data/my_corpus.txt \ --checkpoint_dir checkpoints/tokenizer \ --destination_path data/pretrain \ --split train,val # Step 2: Configure architecture # config/custom-model.yaml # model_name: custom-160m # block_size: 2048 # vocab_size: 50304 # n_layer: 12 # n_head: 12 # n_embd: 768 # Step 3: Launch pretraining litgpt pretrain \ --config config/custom-model.yaml \ --data.data_dir data/pretrain \ --train.max_tokens 10_000_000_000 # Multi-GPU with FSDP litgpt pretrain \ --config config/custom-model.yaml \ --data.data_dir data/pretrain \ --devices 8 \ --train.max_tokens 100_000_000_000

Quantization and Deployment

# Quantize model for smaller deployment # litgpt convert_lit_checkpoint out/phi2-lora/final --quantize bnb.nf4 # Deploy with FastAPI from fastapi import FastAPI from litgpt import LLM app = FastAPI() llm = LLM.load("out/phi2-merged") @app.post("/generate") def generate(prompt: str, max_tokens: int = 100): result = llm.generate(prompt, max_new_tokens=max_tokens, temperature=0.7) return {"response": result} # Run: uvicorn api:app --host 0.0.0.0 --port 8000

Configuration Reference

LoRA ParameterTypical ValueDescription
lora_r8, 16, 32, 64LoRA rank (higher = more capacity, larger adapters)
lora_alpha2x lora_rLoRA scaling factor
lora_dropout0.05Dropout for LoRA layers
lora_querytrueApply LoRA to attention query projection
lora_keyfalseApply LoRA to attention key projection
lora_valuetrueApply LoRA to attention value projection
lora_projectiontrueApply LoRA to output projection
lora_mlpfalseApply LoRA to MLP layers
lora_headfalseApply LoRA to language model head
Training ParameterDefaultDescription
train.learning_rate1e-4 (LoRA), 2e-5 (full)Optimizer learning rate
train.micro_batch_size1-4Per-device batch size
train.global_batch_size16-64Effective batch size across accumulation steps
train.max_steps1000Maximum training steps
train.epochs3Number of training epochs
train.gradient_accumulation_itersautoGradient accumulation steps
Hardware RequirementVRAM Needed
Inference (Phi-2, 2.7B)6 GB
LoRA fine-tuning (7B)16 GB
Full fine-tuning (7B)40+ GB
Pretraining (1B)24 GB

Best Practices

  1. Start with LoRA before full fine-tuning: LoRA achieves 90-95% of full fine-tuning quality at a fraction of the memory cost, making it the default choice for most fine-tuning tasks.

  2. Choose the right LoRA rank: Use r=8 for lightweight adapters (2-4 MB), r=16 for standard quality, r=32 for complex tasks, and r=64 only when maximum capacity is needed.

  3. Use gradient accumulation for large effective batches: Set micro_batch_size=1 and increase gradient_accumulation_iters to simulate large batches on limited VRAM.

  4. Monitor with TensorBoard: Add --train.logger_name tensorboard to any training command and run tensorboard --logdir out/ to visualize training curves.

  5. Validate early and often: Set --eval_interval to check validation loss frequently during training, catching overfitting before wasting compute.

  6. Use Alpaca format consistently: Structure all instruction datasets as {"instruction", "input", "output"} JSON for compatibility with LitGPT data loaders.

  7. Merge LoRA weights for deployment: Always merge adapters into the base model before production deployment to avoid the overhead of adapter loading at inference time.

  8. Enable Flash Attention automatically: LitGPT enables Flash Attention on Ampere+ GPUs (A100, RTX 30/40 series) by default with no configuration needed.

  9. Use BFloat16 precision: Set --precision bf16-true for training stability on modern GPUs, avoiding the numerical issues of float16.

  10. Test on small models first: Prototype your data pipeline and training configuration on Phi-2 (2.7B) or Llama-3-1B before scaling to larger models.

Troubleshooting

Out of memory during fine-tuning Switch from litgpt finetune to litgpt finetune_lora. If already using LoRA, reduce micro_batch_size to 1 and increase gradient_accumulation_iters. Reduce lora_r from 16 to 8.

Model not found after download Run ls checkpoints/ to verify the directory structure. Model names are case-sensitive: use the exact name from litgpt download list.

Training loss not decreasing Lower the learning rate (try 3e-5 for full fine-tuning, 5e-5 for LoRA). Verify your dataset format matches the expected Alpaca JSON structure. Check that the dataset is not empty or corrupt.

LoRA adapters too large Reduce lora_r to 8. Disable LoRA on less critical layers: set lora_projection false and lora_mlp false.

Slow training speed Ensure PyTorch 2.0+ is installed for automatic compilation. Verify GPU utilization with nvidia-smi. Use --train.micro_batch_size as large as VRAM allows.

Generation quality poor after fine-tuning Train for more steps or epochs. Increase dataset size (minimum 1000 examples recommended). Lower temperature during inference (0.5-0.7).

FSDP multi-GPU errors Ensure all GPUs have the same VRAM. Use torchrun --nproc_per_node=N for multi-GPU launches. Check NCCL connectivity with NCCL_DEBUG=INFO.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates