Smart Model Architecture Selection

Overview

Choosing the right model architecture is the single most impactful decision in any machine learning project, yet it is often made based on hype rather than systematic analysis. This guide provides a structured decision framework for selecting neural network architectures based on your specific requirements: task type, data characteristics, hardware constraints, latency budgets, and deployment targets. It covers the complete landscape from transformer variants through state-space models, mixture-of-experts, and hybrid approaches, with practical decision trees, benchmark comparisons, and implementation patterns for each architecture family.

When to Use

Starting a new ML project: You need a systematic framework for choosing between architectures rather than defaulting to the most popular option
Evaluating architecture alternatives: You want to compare transformers vs. SSMs vs. hybrid models for your specific workload
Hardware-constrained deployment: You need to select architectures that fit within specific VRAM, latency, or throughput budgets
Scaling decisions: You are deciding model size, depth, and width based on your data volume and compute budget
Fine-tuning strategy: You need to choose between full fine-tuning, LoRA, QLoRA, or other PEFT methods based on your hardware and quality requirements
Long-context applications: You need to evaluate architectures that handle extended sequences efficiently

Quick Start


# Architecture selection decision tree (pseudocode)
def select_architecture(requirements):
    if requirements.task == "text_generation":
        if requirements.context_length > 100_000:
            if requirements.memory_budget == "limited":
                return "RWKV or Mamba"  # O(1) memory per token
            else:
                return "Transformer + RoPE + NTK scaling"
        elif requirements.quality == "maximum":
            return "Transformer (LLaMA/Mistral family)"
        elif requirements.training_compute == "minimal":
            return "Start with pretrained, LoRA fine-tune"
    
    elif requirements.task == "classification":
        if requirements.data_size < 10_000:
            return "Pretrained encoder + linear head"
        else:
            return "Fine-tuned transformer encoder"
    
    elif requirements.task == "multimodal":
        return "Vision encoder + LLM bridge (LLaVA-style)"
    
    return "Default: Transformer with LoRA fine-tuning"


# Quick evaluation setup
pip install 'litgpt[extra]' transformers accelerate

# Test a small model on your task
python -c "
from litgpt import LLM
llm = LLM.load('microsoft/phi-2')  # 2.7B, good for prototyping
print(llm.generate('Your task-specific prompt here', max_new_tokens=100))
"

Core Concepts

Architecture Decision Matrix


# Decision factors and weights for architecture selection

decision_factors = {
    "quality": {
        "weight": 0.3,
        "transformers": 10,      # Best quality on most benchmarks
        "rwkv": 8,               # Competitive, improving rapidly
        "mamba": 8,              # Strong on language tasks
        "mixture_of_experts": 9, # Excellent for diverse tasks
    },
    "inference_speed": {
        "weight": 0.2,
        "transformers": 6,       # KV cache overhead
        "rwkv": 9,               # O(1) per token
        "mamba": 9,              # O(1) per token
        "mixture_of_experts": 7, # Router overhead, but sparse
    },
    "memory_efficiency": {
        "weight": 0.2,
        "transformers": 5,       # O(n) KV cache
        "rwkv": 10,              # O(1) state
        "mamba": 10,             # O(1) state
        "mixture_of_experts": 4, # Large total params
    },
    "ecosystem_maturity": {
        "weight": 0.15,
        "transformers": 10,      # HuggingFace, vLLM, etc.
        "rwkv": 6,               # Growing community
        "mamba": 6,              # Growing community
        "mixture_of_experts": 7, # Good support
    },
    "training_efficiency": {
        "weight": 0.15,
        "transformers": 7,       # O(n^2) but well-optimized
        "rwkv": 8,               # O(n) parallelizable
        "mamba": 8,              # O(n) with hardware-aware design
        "mixture_of_experts": 6, # Load balancing complexity
    }
}

Model Size Selection


# Chinchilla-optimal scaling: train on ~20 tokens per parameter
# Llama-optimal: over-train on 100-300 tokens per parameter (better per-FLOP)

scaling_guide = {
    # Format: (params, recommended_tokens, use_case)
    "tiny":    ("125M",  "2.5B-25B",   "Education, rapid prototyping"),
    "small":   ("1-3B",  "50B-300B",   "On-device, edge deployment"),
    "medium":  ("7-8B",  "1T-2T",      "General purpose, single GPU serving"),
    "large":   ("13B",   "2T-4T",      "High quality, multi-GPU serving"),
    "xl":      ("34-70B","4T-15T",     "Maximum quality, cluster serving"),
}

# Rule of thumb for fine-tuning data requirements
fine_tuning_data = {
    "LoRA":         "1K-100K examples",    # Task-specific adaptation
    "Full fine-tune":"10K-1M examples",     # Significant behavior change
    "Continued pretrain": "1B+ tokens",     # Domain adaptation
}

Transformer Variant Comparison


# Key architectural choices in modern transformers

architectures = {
    "LLaMA-3": {
        "attention": "GQA (8 KV heads for 32 Q heads)",
        "position": "RoPE",
        "activation": "SwiGLU",
        "normalization": "RMSNorm (pre-norm)",
        "context": "8K-128K",
        "strengths": "Best open-source quality, strong ecosystem"
    },
    "Mistral-7B": {
        "attention": "GQA + Sliding Window (4096)",
        "position": "RoPE",
        "activation": "SwiGLU",
        "normalization": "RMSNorm (pre-norm)",
        "context": "32K",
        "strengths": "Excellent quality/size ratio, sliding window efficiency"
    },
    "Mixtral-8x7B": {
        "attention": "GQA",
        "position": "RoPE",
        "activation": "SwiGLU + MoE (top-2 of 8 experts)",
        "normalization": "RMSNorm",
        "context": "32K",
        "strengths": "47B total params, 13B active per token"
    },
    "Phi-3": {
        "attention": "MHA with Flash Attention",
        "position": "RoPE (LongRoPE for 128K)",
        "activation": "SwiGLU",
        "normalization": "RMSNorm",
        "context": "4K-128K",
        "strengths": "Small but capable, optimized training data"
    },
    "Qwen-2.5": {
        "attention": "GQA",
        "position": "RoPE (YaRN for extrapolation)",
        "activation": "SwiGLU",
        "normalization": "RMSNorm",
        "context": "32K-128K",
        "strengths": "Strong multilingual, code, and math"
    }
}

PEFT Method Selection


from peft import LoraConfig, get_peft_model, TaskType

# Decision tree for PEFT method selection
def select_peft_method(gpu_vram, model_size, data_size, quality_requirement):
    if gpu_vram >= model_size * 4:  # Full BF16 + optimizer states
        return "full_fine_tuning"
    elif gpu_vram >= model_size * 1.5:
        if quality_requirement == "high":
            return "lora_r32"
        else:
            return "lora_r16"
    elif gpu_vram >= model_size * 0.5:
        return "qlora_4bit"
    else:
        return "prefix_tuning_or_smaller_model"

# LoRA configuration by use case
lora_configs = {
    "lightweight": LoraConfig(
        r=8, lora_alpha=16,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05, task_type=TaskType.CAUSAL_LM
    ),
    "standard": LoraConfig(
        r=16, lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        lora_dropout=0.05, task_type=TaskType.CAUSAL_LM
    ),
    "high_capacity": LoraConfig(
        r=64, lora_alpha=128,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.1, task_type=TaskType.CAUSAL_LM
    )
}

Benchmarking Framework


import time
import torch

def benchmark_architecture(model, tokenizer, test_prompts, device="cuda"):
    """Benchmark model on latency, throughput, and memory."""
    results = {}
    
    # 1. Time to first token (TTFT)
    prompt = test_prompts[0]
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    torch.cuda.synchronize()
    start = time.perf_counter()
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=1)
    torch.cuda.synchronize()
    results["ttft_ms"] = (time.perf_counter() - start) * 1000
    
    # 2. Tokens per second (throughput)
    torch.cuda.synchronize()
    start = time.perf_counter()
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=100)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    results["tokens_per_second"] = 100 / elapsed
    
    # 3. Peak VRAM usage
    torch.cuda.reset_peak_memory_stats()
    with torch.no_grad():
        model.generate(**inputs, max_new_tokens=100)
    results["peak_vram_gb"] = torch.cuda.max_memory_allocated() / 1e9
    
    # 4. Quality evaluation (perplexity on held-out data)
    # ... add task-specific evaluation
    
    return results

Configuration Reference

Architecture	Best For	Params/Quality Sweet Spot	Inference Memory
LLaMA 3	General tasks	8B (single GPU), 70B (cluster)	O(n) KV cache
Mistral	Efficiency-focused	7B	O(n) with sliding window
Mixtral MoE	Diverse tasks	8x7B (13B active)	O(n) + expert weights
Phi-3	Small deployment	3.8B	O(n) KV cache
RWKV	Long context	7B-14B	O(1) constant
Mamba	Long context	1.4B-2.8B	O(1) constant

PEFT Method	Trainable %	VRAM Savings	Quality vs Full FT
Full fine-tuning	100%	0%	100% (baseline)
LoRA (r=16)	0.1-1%	60-75%	95-99%
QLoRA (r=16, 4-bit)	0.1-1%	80-90%	90-97%
Prefix tuning	<0.1%	85-95%	80-90%
Adapter tuning	1-5%	50-70%	93-98%

Hardware Budget	Recommended Model	Method
8 GB VRAM	Phi-3 Mini (3.8B)	QLoRA
16 GB VRAM	LLaMA-3-8B or Mistral-7B	LoRA
24 GB VRAM	LLaMA-3-8B	Full fine-tune or LoRA r=64
40 GB VRAM	LLaMA-3-13B	LoRA
80 GB VRAM	LLaMA-3-70B	QLoRA
8x80 GB	LLaMA-3-70B	Full fine-tune with FSDP

Best Practices

Prototype on small models first: Always validate your data pipeline, training setup, and evaluation on a 1-3B parameter model before scaling. Issues found at scale are expensive to debug.
Match architecture to deployment constraints: Select your architecture based on inference requirements (latency, throughput, memory) first, then maximize quality within those constraints.
Use Chinchilla or Llama scaling laws: For pretraining, allocate compute according to established scaling laws. For fine-tuning, more data almost always helps up to diminishing returns.
Default to LoRA for fine-tuning: Start with LoRA (r=16) on a strong pretrained model. Only move to full fine-tuning when LoRA demonstrably underperforms on your specific evaluation suite.
Benchmark on YOUR task, not general benchmarks: MMLU scores do not predict performance on domain-specific tasks. Build a task-specific evaluation set and use it to compare architectures.
Consider inference cost, not just training cost: A model that is 10% better but 3x more expensive to serve may not be worth it. Calculate total cost of ownership including serving infrastructure.
Evaluate SSMs for long-context workloads: If your application processes sequences longer than 32K tokens regularly, benchmark RWKV or Mamba against transformers on your workload.
Use MoE for diverse task portfolios: Mixture-of-Experts models provide excellent quality across varied tasks while keeping per-token compute low, ideal for general-purpose APIs.
Plan for quantization at deployment: Design your training pipeline knowing the model will be quantized (INT8, INT4, or GPTQ) for production serving. Evaluate quality at the target quantization level.
Version your architecture decisions: Document why you chose a specific architecture, what alternatives you considered, and what benchmarks you used. This prevents relitigating decisions and enables informed future upgrades.

Troubleshooting

Model too slow for latency requirements Profile to identify bottleneck (prefill vs. decode). Use quantization (INT8 reduces latency ~30%). Consider speculative decoding for faster generation. Switch to SSM architecture for constant-time inference.

Quality does not match published benchmarks Verify you are using the correct prompt format and system prompt. Check tokenizer compatibility. Evaluate on the same benchmark splits (not contaminated data).

LoRA not converging Increase learning rate (try 2e-4 to 5e-4). Target more modules (add k_proj, o_proj, MLP layers). Increase rank from 16 to 32 or 64. Verify data quality and format.

Out of memory when loading model Use quantization: 4-bit loading reduces memory by 75%. Use device_map="auto" for automatic device splitting. Consider a smaller model in the same family.

Fine-tuned model hallucinates more than base model Reduce learning rate (overfitting to training distribution). Add more diverse training data. Use shorter training (fewer epochs). Evaluate with temperature=0 for deterministic comparison.

Cannot decide between two architectures Run a controlled A/B test: same data, same compute budget, same evaluation suite. If the difference is within noise, choose the one with the better ecosystem and deployment tooling.

MoE model routing is unbalanced Add auxiliary load-balancing loss during training. Monitor expert utilization rates. Ensure batch diversity so the router sees varied inputs during each step.

⚠️ Loading Issue

Model Architecture Smart

Smart Model Architecture Selection

Overview

When to Use

Quick Start

Core Concepts

Architecture Decision Matrix

Model Size Selection

Transformer Variant Comparison

PEFT Method Selection

Benchmarking Framework

Configuration Reference

Best Practices

Troubleshooting

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace