Distributed Training Complete
Boost productivity using this expert, guidance, fully, sharded. Includes structured workflows, validation checks, and reusable patterns for ai research.
Complete Guide to Distributed Training
Overview
A comprehensive skill covering all aspects of distributed deep learning training ā from data parallelism fundamentals to advanced 3D parallelism strategies. This guide unifies the approaches across frameworks (PyTorch DDP, HuggingFace Accelerate, DeepSpeed, FSDP, Megatron-Core) and helps you choose the right strategy for your model size, hardware, and performance requirements.
When to Use
- Training models across multiple GPUs or nodes
- Model doesn't fit in single GPU memory
- Need to reduce training time for large datasets
- Choosing between parallelism strategies
- Optimizing training throughput and efficiency
- Setting up multi-node training infrastructure
Quick Start
# Simplest path: HuggingFace Accelerate pip install accelerate accelerate config # Interactive setup wizard accelerate launch train.py # PyTorch native DDP torchrun --nproc_per_node=4 train.py # DeepSpeed pip install deepspeed deepspeed train.py --deepspeed ds_config.json
Parallelism Strategies
Data Parallelism (DP/DDP)
# PyTorch DDP ā replicate model, split data import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP dist.init_process_group('nccl') model = DDP(model.to(rank), device_ids=[rank]) sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=world_size, rank=rank) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32) for batch in dataloader: loss = model(batch) loss.backward() # Gradients synchronized automatically optimizer.step()
Fully Sharded Data Parallel (FSDP)
# PyTorch FSDP ā shard parameters across GPUs from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp import MixedPrecision mp_policy = MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, ) model = FSDP( model, mixed_precision=mp_policy, auto_wrap_policy=transformer_auto_wrap_policy, sharding_strategy=ShardingStrategy.FULL_SHARD, )
Model Parallelism (Tensor + Pipeline)
Tensor Parallel: Split layers horizontally
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
ā GPU 0 ā ā GPU 1 ā
ā Rows 0-2047 ā ā Rows 2048+ ā
ā of Linear ā ā of Linear ā
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
Pipeline Parallel: Split layers vertically
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
ā GPU 0 ā ā GPU 1 ā
ā Layers 0-15 āāāāāā Layers 16-31ā
āāāāāāāāāāāāāāā āāāāāāāāāāāāāāā
Strategy Selection Guide
| Model Size | GPUs Available | Recommended Strategy |
|---|---|---|
| < 1B | 1-8 | DDP + bf16 |
| 1-7B | 2-8 | FSDP or DeepSpeed ZeRO-2 |
| 7-13B | 4-8 | FSDP or DeepSpeed ZeRO-2/3 |
| 13-70B | 8-32 | DeepSpeed ZeRO-3 or Megatron TP+PP |
| 70B+ | 32+ | Megatron 3D parallelism |
| MoE | 64+ | Megatron + Expert parallelism |
Framework Comparison
| Feature | PyTorch DDP | FSDP | DeepSpeed | Accelerate | Megatron |
|---|---|---|---|---|---|
| Ease of use | Medium | Medium | Easy (config) | Easiest | Hard |
| Max model size | GPU memory | Node memory | Unlimited | Depends on backend | Unlimited |
| Performance | Good | Good | Very good | Good | Best |
| Custom models | Full support | Full support | Full support | Full support | Limited |
| HF integration | Manual | Good | Excellent | Built-in | Limited |
| Community | Large | Growing | Large | Large | Small |
Performance Optimization
Communication Optimization
# Overlap communication with computation # DeepSpeed { "zero_optimization": { "overlap_comm": true } } # PyTorch DDP model = DDP(model, device_ids=[rank], find_unused_parameters=False) # Set find_unused_parameters=False when all params are used every forward pass # NCCL environment tuning os.environ['NCCL_ALGO'] = 'Ring' # or 'Tree' for large clusters os.environ['NCCL_SOCKET_IFNAME'] = 'eth0' # Network interface
Data Loading
# Efficient distributed data loading dataloader = DataLoader( dataset, batch_size=per_gpu_batch, sampler=DistributedSampler(dataset), num_workers=4, pin_memory=True, prefetch_factor=2, persistent_workers=True, )
Best Practices
- Start simple, scale up ā Begin with DDP, move to FSDP/DeepSpeed only when needed
- Profile before optimizing ā Measure GPU utilization, communication overhead, data loading time
- Match TP to NVLink topology ā Keep tensor parallel within NVLink-connected GPUs
- Use gradient accumulation ā Simulate larger batches without more memory
- Checkpoint frequently ā Save every 1000 steps; large runs are expensive to restart
- Use bf16 universally ā Simpler than fp16 (no loss scaling) with similar memory savings
- Monitor GPU memory ā Use
nvidia-smiortorch.cuda.memory_summary()to track usage - Scale learning rate ā Linear scaling rule: multiply LR by the total batch size increase
- Warm up learning rate ā Always warm up for 1-5% of total steps in distributed training
- Test at small scale first ā Verify training loop with 1-2 GPUs before launching full runs
Troubleshooting
NCCL errors on multi-node
# Set correct network interface export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_DISABLE=0 # Enable InfiniBand if available export NCCL_DEBUG=INFO # Verbose debugging
Gradient explosion after scaling
# Apply gradient clipping torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # And warm up learning rate scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=1000)
Uneven GPU memory usage
# With pipeline parallelism, first and last stages use more memory # Balance by assigning fewer layers to first/last stages # Or use virtual pipeline stages for better load balancing
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.