Complete Guide to Distributed Training

Overview

A comprehensive skill covering all aspects of distributed deep learning training — from data parallelism fundamentals to advanced 3D parallelism strategies. This guide unifies the approaches across frameworks (PyTorch DDP, HuggingFace Accelerate, DeepSpeed, FSDP, Megatron-Core) and helps you choose the right strategy for your model size, hardware, and performance requirements.

When to Use

Training models across multiple GPUs or nodes
Model doesn't fit in single GPU memory
Need to reduce training time for large datasets
Choosing between parallelism strategies
Optimizing training throughput and efficiency
Setting up multi-node training infrastructure

Quick Start


# Simplest path: HuggingFace Accelerate
pip install accelerate
accelerate config  # Interactive setup wizard
accelerate launch train.py

# PyTorch native DDP
torchrun --nproc_per_node=4 train.py

# DeepSpeed
pip install deepspeed
deepspeed train.py --deepspeed ds_config.json

Parallelism Strategies

Data Parallelism (DP/DDP)


# PyTorch DDP — replicate model, split data
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group('nccl')
model = DDP(model.to(rank), device_ids=[rank])

sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

for batch in dataloader:
    loss = model(batch)
    loss.backward()  # Gradients synchronized automatically
    optimizer.step()

Fully Sharded Data Parallel (FSDP)


# PyTorch FSDP — shard parameters across GPUs
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import MixedPrecision

mp_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)

model = FSDP(
    model,
    mixed_precision=mp_policy,
    auto_wrap_policy=transformer_auto_wrap_policy,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
)

Model Parallelism (Tensor + Pipeline)

Tensor Parallel: Split layers horizontally
┌─────────────┐    ┌─────────────┐
│  GPU 0       │    │  GPU 1       │
│  Rows 0-2047 │    │  Rows 2048+  │
│  of Linear   │    │  of Linear   │
└─────────────┘    └─────────────┘

Pipeline Parallel: Split layers vertically
┌─────────────┐    ┌─────────────┐
│  GPU 0       │    │  GPU 1       │
│  Layers 0-15 │───→│  Layers 16-31│
└─────────────┘    └─────────────┘

Strategy Selection Guide

Model Size	GPUs Available	Recommended Strategy
< 1B	1-8	DDP + bf16
1-7B	2-8	FSDP or DeepSpeed ZeRO-2
7-13B	4-8	FSDP or DeepSpeed ZeRO-2/3
13-70B	8-32	DeepSpeed ZeRO-3 or Megatron TP+PP
70B+	32+	Megatron 3D parallelism
MoE	64+	Megatron + Expert parallelism

Framework Comparison

Feature	PyTorch DDP	FSDP	DeepSpeed	Accelerate	Megatron
Ease of use	Medium	Medium	Easy (config)	Easiest	Hard
Max model size	GPU memory	Node memory	Unlimited	Depends on backend	Unlimited
Performance	Good	Good	Very good	Good	Best
Custom models	Full support	Full support	Full support	Full support	Limited
HF integration	Manual	Good	Excellent	Built-in	Limited
Community	Large	Growing	Large	Large	Small

Performance Optimization

Communication Optimization


# Overlap communication with computation
# DeepSpeed
{ "zero_optimization": { "overlap_comm": true } }

# PyTorch DDP
model = DDP(model, device_ids=[rank], find_unused_parameters=False)
# Set find_unused_parameters=False when all params are used every forward pass

# NCCL environment tuning
os.environ['NCCL_ALGO'] = 'Ring'  # or 'Tree' for large clusters
os.environ['NCCL_SOCKET_IFNAME'] = 'eth0'  # Network interface

Data Loading


# Efficient distributed data loading
dataloader = DataLoader(
    dataset,
    batch_size=per_gpu_batch,
    sampler=DistributedSampler(dataset),
    num_workers=4,
    pin_memory=True,
    prefetch_factor=2,
    persistent_workers=True,
)

Best Practices

Start simple, scale up — Begin with DDP, move to FSDP/DeepSpeed only when needed
Profile before optimizing — Measure GPU utilization, communication overhead, data loading time
Match TP to NVLink topology — Keep tensor parallel within NVLink-connected GPUs
Use gradient accumulation — Simulate larger batches without more memory
Checkpoint frequently — Save every 1000 steps; large runs are expensive to restart
Use bf16 universally — Simpler than fp16 (no loss scaling) with similar memory savings
Monitor GPU memory — Use nvidia-smi or torch.cuda.memory_summary() to track usage
Scale learning rate — Linear scaling rule: multiply LR by the total batch size increase
Warm up learning rate — Always warm up for 1-5% of total steps in distributed training
Test at small scale first — Verify training loop with 1-2 GPUs before launching full runs

Troubleshooting

NCCL errors on multi-node


# Set correct network interface
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0  # Enable InfiniBand if available
export NCCL_DEBUG=INFO     # Verbose debugging

Gradient explosion after scaling


# Apply gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# And warm up learning rate
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=1000)

Uneven GPU memory usage


# With pipeline parallelism, first and last stages use more memory
# Balance by assigning fewer layers to first/last stages
# Or use virtual pipeline stages for better load balancing

⚠️ Loading Issue

Distributed Training Complete