Complete Model Architecture Guide

Overview

Selecting and implementing the right model architecture is the most consequential decision in any deep learning project. This comprehensive guide covers the full landscape of modern neural network architectures -- from transformer-based large language models through efficient alternatives like state-space models, and practical considerations for architecture selection, modification, and deployment. Whether you are pretraining from scratch, fine-tuning an existing model, or designing a custom architecture for a specific domain, this guide provides the decision frameworks, implementation patterns, and operational knowledge needed to make informed choices.

When to Use

Architecture selection: You are starting a new project and need to choose between transformers, SSMs, RNNs, or hybrid approaches based on your requirements
Model customization: You want to modify an existing architecture (add layers, change attention patterns, adjust dimensions) for your specific use case
Pretraining planning: You are planning a pretraining run and need to understand how architecture choices affect compute cost, memory, and downstream performance
Fine-tuning strategy: You need to decide between full fine-tuning, LoRA, QLoRA, prefix tuning, or other parameter-efficient methods
Performance optimization: You need to understand the compute and memory tradeoffs of different architectural components
Scaling decisions: You are deciding model size, depth, width, and context length based on available hardware

Quick Start


# Install LitGPT for clean architecture implementations
pip install 'litgpt[extra]'

# Explore available architectures
litgpt download list

# Download and inspect a model
litgpt download microsoft/phi-2
python -c "
from litgpt import LLM
llm = LLM.load('microsoft/phi-2')
print(llm.generate('Hello world', max_new_tokens=20))
"


# Or use HuggingFace for the broadest model support
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# Inspect architecture
print(model)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Core Concepts

Architecture Families

Transformer-based (Decoder-only) The dominant architecture for generative language models. All tokens attend to all previous tokens via causal self-attention.


# Key components of a transformer block
class TransformerBlock:
    """
    Input -> LayerNorm -> Multi-Head Attention -> Residual -> LayerNorm -> MLP -> Residual
    """
    def forward(self, x):
        # Pre-norm architecture (modern standard)
        h = x + self.attention(self.ln1(x))
        out = h + self.mlp(self.ln2(h))
        return out

Major model families:

LLaMA / Llama 3: RoPE embeddings, SwiGLU activation, GQA attention
GPT / GPT-NeoX: Learned position embeddings, GELU activation
Phi: Small but capable, optimized training data
Mistral / Mixtral: Sliding window attention, Mixture of Experts
Gemma: Google's open models with multi-query attention
Qwen: Alibaba's multilingual models with long context

State-Space Models (SSMs) Linear-time alternatives to transformers for sequence modeling.


# RWKV: Combines transformer training with RNN inference
# Key advantage: O(1) memory per token during inference
from rwkv.model import RWKV

model = RWKV(model='RWKV-6-World-7B', strategy='cuda fp16')

# Process tokens sequentially with constant memory
state = None
for token in token_sequence:
    out, state = model.forward([token], state)
    # state size is constant regardless of sequence length

RWKV: RNN-transformer hybrid, linear time complexity, infinite context
Mamba: Selective state-space model, hardware-aware design
RetNet: Retention mechanism for parallel training, recurrent inference

Encoder-Decoder Bidirectional encoding with autoregressive decoding, primarily for translation and summarization.


# T5-style architecture
from transformers import T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("t5-base")
# Encoder sees full input bidirectionally
# Decoder generates output autoregressively

Attention Mechanisms


import torch
import torch.nn.functional as F

# Standard Multi-Head Attention
def multi_head_attention(Q, K, V, num_heads):
    """Standard O(n^2) attention."""
    d_k = Q.size(-1) // num_heads
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, V)

# Grouped-Query Attention (GQA) - used in Llama 3, Mistral
# Reduces KV cache size by sharing key-value heads
def grouped_query_attention(Q, K, V, num_q_heads, num_kv_heads):
    """GQA: num_kv_heads < num_q_heads, KV heads are shared."""
    group_size = num_q_heads // num_kv_heads
    K = K.repeat_interleave(group_size, dim=1)
    V = V.repeat_interleave(group_size, dim=1)
    return multi_head_attention(Q, K, V, num_q_heads)

# Multi-Query Attention (MQA) - used in Falcon, PaLM
# Single KV head shared across all query heads
# GQA with num_kv_heads=1

Position Encoding Strategies


import torch
import math

# Rotary Position Embeddings (RoPE) - LLaMA, Mistral, Qwen
def apply_rotary_emb(x, cos, sin):
    """RoPE: encodes position via rotation in embedding space."""
    x_rot = torch.stack([-x[..., 1::2], x[..., ::2]], dim=-1).flatten(-2)
    return x * cos + x_rot * sin

# ALiBi (Attention with Linear Biases) - BLOOM, MPT
def alibi_bias(seq_len, num_heads):
    """ALiBi: adds distance-based bias to attention scores."""
    slopes = torch.tensor([2 ** (-8 * i / num_heads) for i in range(num_heads)])
    positions = torch.arange(seq_len)
    bias = -slopes[:, None, None] * (positions[None, None, :] - positions[None, :, None]).abs()
    return bias

# Learned absolute position embeddings - GPT-2, BERT
class LearnedPositionEncoding(torch.nn.Module):
    def __init__(self, max_len, d_model):
        super().__init__()
        self.embedding = torch.nn.Embedding(max_len, d_model)

Model Scaling Decisions


# Chinchilla scaling laws: optimal ratio of parameters to training tokens
# For compute budget C:
#   Optimal parameters N ~ C^0.5
#   Optimal tokens D ~ C^0.5
#   Ratio: D/N ~ 20 (train on 20 tokens per parameter)

# Example scaling table
scaling_configs = {
    "125M":  {"n_layer": 12, "n_head": 12, "n_embd": 768,   "tokens": "2.5B"},
    "350M":  {"n_layer": 24, "n_head": 16, "n_embd": 1024,  "tokens": "7B"},
    "1.3B":  {"n_layer": 24, "n_head": 32, "n_embd": 2048,  "tokens": "26B"},
    "6.7B":  {"n_layer": 32, "n_head": 32, "n_embd": 4096,  "tokens": "134B"},
    "13B":   {"n_layer": 40, "n_head": 40, "n_embd": 5120,  "tokens": "260B"},
    "70B":   {"n_layer": 80, "n_head": 64, "n_embd": 8192,  "tokens": "1.4T"},
}

Parameter-Efficient Fine-Tuning (PEFT)


from peft import LoraConfig, get_peft_model

# LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Rank of adaptation matrices
    lora_alpha=32,                 # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)
print(f"Trainable: {model.print_trainable_parameters()}")
# Typically 0.1-1% of total parameters

# QLoRA: LoRA + 4-bit quantization
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config
)
model = get_peft_model(model, lora_config)
# Fine-tune 8B model on 16GB GPU

Configuration Reference

Architecture Feature	Transformer	RWKV	Mamba
Training complexity	O(n^2)	O(n)	O(n)
Inference per token	O(n) KV cache	O(1) constant	O(1) constant
Memory (inference)	Grows with context	Constant	Constant
Parallelizable training	Yes	Yes	Yes
Long context	Limited by memory	Unlimited	Unlimited
Ecosystem maturity	Highest	Growing	Growing

PEFT Method	Trainable Params	Memory Savings	Quality
Full fine-tuning	100%	None	Highest
LoRA (r=16)	0.1-1%	60-80%	Near-full
QLoRA	0.1-1% + 4-bit base	80-90%	Good
Prefix tuning	<0.1%	90%+	Moderate
IA3	<0.01%	95%+	Lower

Best Practices

Follow Chinchilla scaling laws: Train models on approximately 20 tokens per parameter for compute-optimal results. Over-training on fewer tokens wastes parameters; under-training wastes data.
Use GQA over MHA for inference efficiency: Grouped-Query Attention reduces KV cache size proportionally to the number of KV groups, directly improving inference throughput.
Start with LoRA for fine-tuning: Default to LoRA (r=16) for any fine-tuning task. Only switch to full fine-tuning when LoRA quality is demonstrably insufficient for your specific use case.
Pre-norm over post-norm: Use pre-normalization (LayerNorm before attention/MLP) which provides more stable training dynamics and is the standard in all modern architectures.
RoPE for position encoding: Rotary Position Embeddings support context length extrapolation and are the standard choice for new transformer architectures.
Use SwiGLU activation in MLPs: SwiGLU provides better quality than GELU or ReLU at marginal extra compute cost, and is used in LLaMA, Mistral, and most modern models.
Consider SSMs for very long sequences: If your application requires 100K+ token contexts with constant memory, evaluate RWKV or Mamba before defaulting to transformers.
Profile before optimizing: Measure actual throughput and memory usage before making architecture changes. Use torch.profiler or nsys to identify real bottlenecks.
Match architecture to data scale: Small datasets (< 10K examples) benefit more from PEFT on large models than training smaller models from scratch.
Test context length at inference: Validate model quality at the actual context lengths your application uses, not just the maximum supported length.

Troubleshooting

Model quality degrades at long context lengths Check position encoding support for the target length. RoPE models may need NTK-aware scaling or YaRN for lengths beyond training. Evaluate perplexity at target length.

Out of memory during training Apply gradient checkpointing (model.gradient_checkpointing_enable()). Switch to LoRA or QLoRA. Reduce batch size and increase gradient accumulation. Use FSDP for multi-GPU sharding.

LoRA fine-tuning not matching full fine-tuning quality Increase LoRA rank from 16 to 32 or 64. Target more modules (add k_proj, mlp layers). Increase training steps. Verify learning rate is appropriate (typically 1e-4 for LoRA).

KV cache consuming too much memory during inference Switch from MHA to GQA or MQA attention. Use quantized KV cache (FP8). Reduce maximum context length. Consider paged attention (vLLM).

Training loss spikes or diverges Reduce learning rate. Add warmup steps (5-10% of total steps). Check for data quality issues (corrupted examples, extreme length outliers). Use gradient clipping (max_norm=1.0).

State-space model quality not matching transformers SSMs may underperform transformers on tasks requiring precise long-range retrieval. Consider hybrid architectures that combine SSM layers with attention layers.

Multi-GPU training hangs or crashes Verify NCCL installation and GPU interconnect. Set NCCL_DEBUG=INFO for diagnostics. Ensure all GPUs have identical VRAM. Check that batch size is divisible by device count.

⚠️ Loading Issue

Model Architecture Complete

Complete Model Architecture Guide

Overview

When to Use

Quick Start

Core Concepts

Architecture Families

Attention Mechanisms

Position Encoding Strategies

Model Scaling Decisions

Parameter-Efficient Fine-Tuning (PEFT)

Configuration Reference

Best Practices

Troubleshooting

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace