Model Architecture Complete
Powerful skill for state, space, model, complexity. Includes structured workflows, validation checks, and reusable patterns for ai research.
Complete Model Architecture Guide
Overview
Selecting and implementing the right model architecture is the most consequential decision in any deep learning project. This comprehensive guide covers the full landscape of modern neural network architectures -- from transformer-based large language models through efficient alternatives like state-space models, and practical considerations for architecture selection, modification, and deployment. Whether you are pretraining from scratch, fine-tuning an existing model, or designing a custom architecture for a specific domain, this guide provides the decision frameworks, implementation patterns, and operational knowledge needed to make informed choices.
When to Use
- Architecture selection: You are starting a new project and need to choose between transformers, SSMs, RNNs, or hybrid approaches based on your requirements
- Model customization: You want to modify an existing architecture (add layers, change attention patterns, adjust dimensions) for your specific use case
- Pretraining planning: You are planning a pretraining run and need to understand how architecture choices affect compute cost, memory, and downstream performance
- Fine-tuning strategy: You need to decide between full fine-tuning, LoRA, QLoRA, prefix tuning, or other parameter-efficient methods
- Performance optimization: You need to understand the compute and memory tradeoffs of different architectural components
- Scaling decisions: You are deciding model size, depth, width, and context length based on available hardware
Quick Start
# Install LitGPT for clean architecture implementations pip install 'litgpt[extra]' # Explore available architectures litgpt download list # Download and inspect a model litgpt download microsoft/phi-2 python -c " from litgpt import LLM llm = LLM.load('microsoft/phi-2') print(llm.generate('Hello world', max_new_tokens=20)) "
# Or use HuggingFace for the broadest model support from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") # Inspect architecture print(model) print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
Core Concepts
Architecture Families
Transformer-based (Decoder-only) The dominant architecture for generative language models. All tokens attend to all previous tokens via causal self-attention.
# Key components of a transformer block class TransformerBlock: """ Input -> LayerNorm -> Multi-Head Attention -> Residual -> LayerNorm -> MLP -> Residual """ def forward(self, x): # Pre-norm architecture (modern standard) h = x + self.attention(self.ln1(x)) out = h + self.mlp(self.ln2(h)) return out
Major model families:
- LLaMA / Llama 3: RoPE embeddings, SwiGLU activation, GQA attention
- GPT / GPT-NeoX: Learned position embeddings, GELU activation
- Phi: Small but capable, optimized training data
- Mistral / Mixtral: Sliding window attention, Mixture of Experts
- Gemma: Google's open models with multi-query attention
- Qwen: Alibaba's multilingual models with long context
State-Space Models (SSMs) Linear-time alternatives to transformers for sequence modeling.
# RWKV: Combines transformer training with RNN inference # Key advantage: O(1) memory per token during inference from rwkv.model import RWKV model = RWKV(model='RWKV-6-World-7B', strategy='cuda fp16') # Process tokens sequentially with constant memory state = None for token in token_sequence: out, state = model.forward([token], state) # state size is constant regardless of sequence length
- RWKV: RNN-transformer hybrid, linear time complexity, infinite context
- Mamba: Selective state-space model, hardware-aware design
- RetNet: Retention mechanism for parallel training, recurrent inference
Encoder-Decoder Bidirectional encoding with autoregressive decoding, primarily for translation and summarization.
# T5-style architecture from transformers import T5ForConditionalGeneration model = T5ForConditionalGeneration.from_pretrained("t5-base") # Encoder sees full input bidirectionally # Decoder generates output autoregressively
Attention Mechanisms
import torch import torch.nn.functional as F # Standard Multi-Head Attention def multi_head_attention(Q, K, V, num_heads): """Standard O(n^2) attention.""" d_k = Q.size(-1) // num_heads scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5) attn = F.softmax(scores, dim=-1) return torch.matmul(attn, V) # Grouped-Query Attention (GQA) - used in Llama 3, Mistral # Reduces KV cache size by sharing key-value heads def grouped_query_attention(Q, K, V, num_q_heads, num_kv_heads): """GQA: num_kv_heads < num_q_heads, KV heads are shared.""" group_size = num_q_heads // num_kv_heads K = K.repeat_interleave(group_size, dim=1) V = V.repeat_interleave(group_size, dim=1) return multi_head_attention(Q, K, V, num_q_heads) # Multi-Query Attention (MQA) - used in Falcon, PaLM # Single KV head shared across all query heads # GQA with num_kv_heads=1
Position Encoding Strategies
import torch import math # Rotary Position Embeddings (RoPE) - LLaMA, Mistral, Qwen def apply_rotary_emb(x, cos, sin): """RoPE: encodes position via rotation in embedding space.""" x_rot = torch.stack([-x[..., 1::2], x[..., ::2]], dim=-1).flatten(-2) return x * cos + x_rot * sin # ALiBi (Attention with Linear Biases) - BLOOM, MPT def alibi_bias(seq_len, num_heads): """ALiBi: adds distance-based bias to attention scores.""" slopes = torch.tensor([2 ** (-8 * i / num_heads) for i in range(num_heads)]) positions = torch.arange(seq_len) bias = -slopes[:, None, None] * (positions[None, None, :] - positions[None, :, None]).abs() return bias # Learned absolute position embeddings - GPT-2, BERT class LearnedPositionEncoding(torch.nn.Module): def __init__(self, max_len, d_model): super().__init__() self.embedding = torch.nn.Embedding(max_len, d_model)
Model Scaling Decisions
# Chinchilla scaling laws: optimal ratio of parameters to training tokens # For compute budget C: # Optimal parameters N ~ C^0.5 # Optimal tokens D ~ C^0.5 # Ratio: D/N ~ 20 (train on 20 tokens per parameter) # Example scaling table scaling_configs = { "125M": {"n_layer": 12, "n_head": 12, "n_embd": 768, "tokens": "2.5B"}, "350M": {"n_layer": 24, "n_head": 16, "n_embd": 1024, "tokens": "7B"}, "1.3B": {"n_layer": 24, "n_head": 32, "n_embd": 2048, "tokens": "26B"}, "6.7B": {"n_layer": 32, "n_head": 32, "n_embd": 4096, "tokens": "134B"}, "13B": {"n_layer": 40, "n_head": 40, "n_embd": 5120, "tokens": "260B"}, "70B": {"n_layer": 80, "n_head": 64, "n_embd": 8192, "tokens": "1.4T"}, }
Parameter-Efficient Fine-Tuning (PEFT)
from peft import LoraConfig, get_peft_model # LoRA configuration lora_config = LoraConfig( r=16, # Rank of adaptation matrices lora_alpha=32, # Scaling factor target_modules=["q_proj", "v_proj"], # Which layers to adapt lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(base_model, lora_config) print(f"Trainable: {model.print_trainable_parameters()}") # Typically 0.1-1% of total parameters # QLoRA: LoRA + 4-bit quantization from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=bnb_config ) model = get_peft_model(model, lora_config) # Fine-tune 8B model on 16GB GPU
Configuration Reference
| Architecture Feature | Transformer | RWKV | Mamba |
|---|---|---|---|
| Training complexity | O(n^2) | O(n) | O(n) |
| Inference per token | O(n) KV cache | O(1) constant | O(1) constant |
| Memory (inference) | Grows with context | Constant | Constant |
| Parallelizable training | Yes | Yes | Yes |
| Long context | Limited by memory | Unlimited | Unlimited |
| Ecosystem maturity | Highest | Growing | Growing |
| PEFT Method | Trainable Params | Memory Savings | Quality |
|---|---|---|---|
| Full fine-tuning | 100% | None | Highest |
| LoRA (r=16) | 0.1-1% | 60-80% | Near-full |
| QLoRA | 0.1-1% + 4-bit base | 80-90% | Good |
| Prefix tuning | <0.1% | 90%+ | Moderate |
| IA3 | <0.01% | 95%+ | Lower |
Best Practices
-
Follow Chinchilla scaling laws: Train models on approximately 20 tokens per parameter for compute-optimal results. Over-training on fewer tokens wastes parameters; under-training wastes data.
-
Use GQA over MHA for inference efficiency: Grouped-Query Attention reduces KV cache size proportionally to the number of KV groups, directly improving inference throughput.
-
Start with LoRA for fine-tuning: Default to LoRA (r=16) for any fine-tuning task. Only switch to full fine-tuning when LoRA quality is demonstrably insufficient for your specific use case.
-
Pre-norm over post-norm: Use pre-normalization (LayerNorm before attention/MLP) which provides more stable training dynamics and is the standard in all modern architectures.
-
RoPE for position encoding: Rotary Position Embeddings support context length extrapolation and are the standard choice for new transformer architectures.
-
Use SwiGLU activation in MLPs: SwiGLU provides better quality than GELU or ReLU at marginal extra compute cost, and is used in LLaMA, Mistral, and most modern models.
-
Consider SSMs for very long sequences: If your application requires 100K+ token contexts with constant memory, evaluate RWKV or Mamba before defaulting to transformers.
-
Profile before optimizing: Measure actual throughput and memory usage before making architecture changes. Use
torch.profilerornsysto identify real bottlenecks. -
Match architecture to data scale: Small datasets (< 10K examples) benefit more from PEFT on large models than training smaller models from scratch.
-
Test context length at inference: Validate model quality at the actual context lengths your application uses, not just the maximum supported length.
Troubleshooting
Model quality degrades at long context lengths Check position encoding support for the target length. RoPE models may need NTK-aware scaling or YaRN for lengths beyond training. Evaluate perplexity at target length.
Out of memory during training
Apply gradient checkpointing (model.gradient_checkpointing_enable()). Switch to LoRA or QLoRA. Reduce batch size and increase gradient accumulation. Use FSDP for multi-GPU sharding.
LoRA fine-tuning not matching full fine-tuning quality Increase LoRA rank from 16 to 32 or 64. Target more modules (add k_proj, mlp layers). Increase training steps. Verify learning rate is appropriate (typically 1e-4 for LoRA).
KV cache consuming too much memory during inference Switch from MHA to GQA or MQA attention. Use quantized KV cache (FP8). Reduce maximum context length. Consider paged attention (vLLM).
Training loss spikes or diverges Reduce learning rate. Add warmup steps (5-10% of total steps). Check for data quality issues (corrupted examples, extreme length outliers). Use gradient clipping (max_norm=1.0).
State-space model quality not matching transformers SSMs may underperform transformers on tasks requiring precise long-range retrieval. Consider hybrid architectures that combine SSM layers with attention layers.
Multi-GPU training hangs or crashes
Verify NCCL installation and GPU interconnect. Set NCCL_DEBUG=INFO for diagnostics. Ensure all GPUs have identical VRAM. Check that batch size is divisible by device count.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.