Model Architecture Smart
Comprehensive skill designed for provides, pytorch, native, distributed. Includes structured workflows, validation checks, and reusable patterns for ai research.
Smart Model Architecture Selection
Overview
Choosing the right model architecture is the single most impactful decision in any machine learning project, yet it is often made based on hype rather than systematic analysis. This guide provides a structured decision framework for selecting neural network architectures based on your specific requirements: task type, data characteristics, hardware constraints, latency budgets, and deployment targets. It covers the complete landscape from transformer variants through state-space models, mixture-of-experts, and hybrid approaches, with practical decision trees, benchmark comparisons, and implementation patterns for each architecture family.
When to Use
- Starting a new ML project: You need a systematic framework for choosing between architectures rather than defaulting to the most popular option
- Evaluating architecture alternatives: You want to compare transformers vs. SSMs vs. hybrid models for your specific workload
- Hardware-constrained deployment: You need to select architectures that fit within specific VRAM, latency, or throughput budgets
- Scaling decisions: You are deciding model size, depth, and width based on your data volume and compute budget
- Fine-tuning strategy: You need to choose between full fine-tuning, LoRA, QLoRA, or other PEFT methods based on your hardware and quality requirements
- Long-context applications: You need to evaluate architectures that handle extended sequences efficiently
Quick Start
# Architecture selection decision tree (pseudocode) def select_architecture(requirements): if requirements.task == "text_generation": if requirements.context_length > 100_000: if requirements.memory_budget == "limited": return "RWKV or Mamba" # O(1) memory per token else: return "Transformer + RoPE + NTK scaling" elif requirements.quality == "maximum": return "Transformer (LLaMA/Mistral family)" elif requirements.training_compute == "minimal": return "Start with pretrained, LoRA fine-tune" elif requirements.task == "classification": if requirements.data_size < 10_000: return "Pretrained encoder + linear head" else: return "Fine-tuned transformer encoder" elif requirements.task == "multimodal": return "Vision encoder + LLM bridge (LLaVA-style)" return "Default: Transformer with LoRA fine-tuning"
# Quick evaluation setup pip install 'litgpt[extra]' transformers accelerate # Test a small model on your task python -c " from litgpt import LLM llm = LLM.load('microsoft/phi-2') # 2.7B, good for prototyping print(llm.generate('Your task-specific prompt here', max_new_tokens=100)) "
Core Concepts
Architecture Decision Matrix
# Decision factors and weights for architecture selection decision_factors = { "quality": { "weight": 0.3, "transformers": 10, # Best quality on most benchmarks "rwkv": 8, # Competitive, improving rapidly "mamba": 8, # Strong on language tasks "mixture_of_experts": 9, # Excellent for diverse tasks }, "inference_speed": { "weight": 0.2, "transformers": 6, # KV cache overhead "rwkv": 9, # O(1) per token "mamba": 9, # O(1) per token "mixture_of_experts": 7, # Router overhead, but sparse }, "memory_efficiency": { "weight": 0.2, "transformers": 5, # O(n) KV cache "rwkv": 10, # O(1) state "mamba": 10, # O(1) state "mixture_of_experts": 4, # Large total params }, "ecosystem_maturity": { "weight": 0.15, "transformers": 10, # HuggingFace, vLLM, etc. "rwkv": 6, # Growing community "mamba": 6, # Growing community "mixture_of_experts": 7, # Good support }, "training_efficiency": { "weight": 0.15, "transformers": 7, # O(n^2) but well-optimized "rwkv": 8, # O(n) parallelizable "mamba": 8, # O(n) with hardware-aware design "mixture_of_experts": 6, # Load balancing complexity } }
Model Size Selection
# Chinchilla-optimal scaling: train on ~20 tokens per parameter # Llama-optimal: over-train on 100-300 tokens per parameter (better per-FLOP) scaling_guide = { # Format: (params, recommended_tokens, use_case) "tiny": ("125M", "2.5B-25B", "Education, rapid prototyping"), "small": ("1-3B", "50B-300B", "On-device, edge deployment"), "medium": ("7-8B", "1T-2T", "General purpose, single GPU serving"), "large": ("13B", "2T-4T", "High quality, multi-GPU serving"), "xl": ("34-70B","4T-15T", "Maximum quality, cluster serving"), } # Rule of thumb for fine-tuning data requirements fine_tuning_data = { "LoRA": "1K-100K examples", # Task-specific adaptation "Full fine-tune":"10K-1M examples", # Significant behavior change "Continued pretrain": "1B+ tokens", # Domain adaptation }
Transformer Variant Comparison
# Key architectural choices in modern transformers architectures = { "LLaMA-3": { "attention": "GQA (8 KV heads for 32 Q heads)", "position": "RoPE", "activation": "SwiGLU", "normalization": "RMSNorm (pre-norm)", "context": "8K-128K", "strengths": "Best open-source quality, strong ecosystem" }, "Mistral-7B": { "attention": "GQA + Sliding Window (4096)", "position": "RoPE", "activation": "SwiGLU", "normalization": "RMSNorm (pre-norm)", "context": "32K", "strengths": "Excellent quality/size ratio, sliding window efficiency" }, "Mixtral-8x7B": { "attention": "GQA", "position": "RoPE", "activation": "SwiGLU + MoE (top-2 of 8 experts)", "normalization": "RMSNorm", "context": "32K", "strengths": "47B total params, 13B active per token" }, "Phi-3": { "attention": "MHA with Flash Attention", "position": "RoPE (LongRoPE for 128K)", "activation": "SwiGLU", "normalization": "RMSNorm", "context": "4K-128K", "strengths": "Small but capable, optimized training data" }, "Qwen-2.5": { "attention": "GQA", "position": "RoPE (YaRN for extrapolation)", "activation": "SwiGLU", "normalization": "RMSNorm", "context": "32K-128K", "strengths": "Strong multilingual, code, and math" } }
PEFT Method Selection
from peft import LoraConfig, get_peft_model, TaskType # Decision tree for PEFT method selection def select_peft_method(gpu_vram, model_size, data_size, quality_requirement): if gpu_vram >= model_size * 4: # Full BF16 + optimizer states return "full_fine_tuning" elif gpu_vram >= model_size * 1.5: if quality_requirement == "high": return "lora_r32" else: return "lora_r16" elif gpu_vram >= model_size * 0.5: return "qlora_4bit" else: return "prefix_tuning_or_smaller_model" # LoRA configuration by use case lora_configs = { "lightweight": LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, task_type=TaskType.CAUSAL_LM ), "standard": LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, task_type=TaskType.CAUSAL_LM ), "high_capacity": LoraConfig( r=64, lora_alpha=128, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.1, task_type=TaskType.CAUSAL_LM ) }
Benchmarking Framework
import time import torch def benchmark_architecture(model, tokenizer, test_prompts, device="cuda"): """Benchmark model on latency, throughput, and memory.""" results = {} # 1. Time to first token (TTFT) prompt = test_prompts[0] inputs = tokenizer(prompt, return_tensors="pt").to(device) torch.cuda.synchronize() start = time.perf_counter() with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=1) torch.cuda.synchronize() results["ttft_ms"] = (time.perf_counter() - start) * 1000 # 2. Tokens per second (throughput) torch.cuda.synchronize() start = time.perf_counter() with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=100) torch.cuda.synchronize() elapsed = time.perf_counter() - start results["tokens_per_second"] = 100 / elapsed # 3. Peak VRAM usage torch.cuda.reset_peak_memory_stats() with torch.no_grad(): model.generate(**inputs, max_new_tokens=100) results["peak_vram_gb"] = torch.cuda.max_memory_allocated() / 1e9 # 4. Quality evaluation (perplexity on held-out data) # ... add task-specific evaluation return results
Configuration Reference
| Architecture | Best For | Params/Quality Sweet Spot | Inference Memory |
|---|---|---|---|
| LLaMA 3 | General tasks | 8B (single GPU), 70B (cluster) | O(n) KV cache |
| Mistral | Efficiency-focused | 7B | O(n) with sliding window |
| Mixtral MoE | Diverse tasks | 8x7B (13B active) | O(n) + expert weights |
| Phi-3 | Small deployment | 3.8B | O(n) KV cache |
| RWKV | Long context | 7B-14B | O(1) constant |
| Mamba | Long context | 1.4B-2.8B | O(1) constant |
| PEFT Method | Trainable % | VRAM Savings | Quality vs Full FT |
|---|---|---|---|
| Full fine-tuning | 100% | 0% | 100% (baseline) |
| LoRA (r=16) | 0.1-1% | 60-75% | 95-99% |
| QLoRA (r=16, 4-bit) | 0.1-1% | 80-90% | 90-97% |
| Prefix tuning | <0.1% | 85-95% | 80-90% |
| Adapter tuning | 1-5% | 50-70% | 93-98% |
| Hardware Budget | Recommended Model | Method |
|---|---|---|
| 8 GB VRAM | Phi-3 Mini (3.8B) | QLoRA |
| 16 GB VRAM | LLaMA-3-8B or Mistral-7B | LoRA |
| 24 GB VRAM | LLaMA-3-8B | Full fine-tune or LoRA r=64 |
| 40 GB VRAM | LLaMA-3-13B | LoRA |
| 80 GB VRAM | LLaMA-3-70B | QLoRA |
| 8x80 GB | LLaMA-3-70B | Full fine-tune with FSDP |
Best Practices
-
Prototype on small models first: Always validate your data pipeline, training setup, and evaluation on a 1-3B parameter model before scaling. Issues found at scale are expensive to debug.
-
Match architecture to deployment constraints: Select your architecture based on inference requirements (latency, throughput, memory) first, then maximize quality within those constraints.
-
Use Chinchilla or Llama scaling laws: For pretraining, allocate compute according to established scaling laws. For fine-tuning, more data almost always helps up to diminishing returns.
-
Default to LoRA for fine-tuning: Start with LoRA (r=16) on a strong pretrained model. Only move to full fine-tuning when LoRA demonstrably underperforms on your specific evaluation suite.
-
Benchmark on YOUR task, not general benchmarks: MMLU scores do not predict performance on domain-specific tasks. Build a task-specific evaluation set and use it to compare architectures.
-
Consider inference cost, not just training cost: A model that is 10% better but 3x more expensive to serve may not be worth it. Calculate total cost of ownership including serving infrastructure.
-
Evaluate SSMs for long-context workloads: If your application processes sequences longer than 32K tokens regularly, benchmark RWKV or Mamba against transformers on your workload.
-
Use MoE for diverse task portfolios: Mixture-of-Experts models provide excellent quality across varied tasks while keeping per-token compute low, ideal for general-purpose APIs.
-
Plan for quantization at deployment: Design your training pipeline knowing the model will be quantized (INT8, INT4, or GPTQ) for production serving. Evaluate quality at the target quantization level.
-
Version your architecture decisions: Document why you chose a specific architecture, what alternatives you considered, and what benchmarks you used. This prevents relitigating decisions and enables informed future upgrades.
Troubleshooting
Model too slow for latency requirements Profile to identify bottleneck (prefill vs. decode). Use quantization (INT8 reduces latency ~30%). Consider speculative decoding for faster generation. Switch to SSM architecture for constant-time inference.
Quality does not match published benchmarks Verify you are using the correct prompt format and system prompt. Check tokenizer compatibility. Evaluate on the same benchmark splits (not contaminated data).
LoRA not converging Increase learning rate (try 2e-4 to 5e-4). Target more modules (add k_proj, o_proj, MLP layers). Increase rank from 16 to 32 or 64. Verify data quality and format.
Out of memory when loading model
Use quantization: 4-bit loading reduces memory by 75%. Use device_map="auto" for automatic device splitting. Consider a smaller model in the same family.
Fine-tuned model hallucinates more than base model Reduce learning rate (overfitting to training distribution). Add more diverse training data. Use shorter training (fewer epochs). Evaluate with temperature=0 for deterministic comparison.
Cannot decide between two architectures Run a controlled A/B test: same data, same compute budget, same evaluation suite. If the difference is within noise, choose the one with the better ecosystem and deployment tooling.
MoE model routing is unbalanced Add auxiliary load-balancing loss during training. Monitor expert utilization rates. Ensure batch diversity so the router sees varied inputs during each step.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.