D

Distributed Training Complete

Boost productivity using this expert, guidance, fully, sharded. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Complete Guide to Distributed Training

Overview

A comprehensive skill covering all aspects of distributed deep learning training — from data parallelism fundamentals to advanced 3D parallelism strategies. This guide unifies the approaches across frameworks (PyTorch DDP, HuggingFace Accelerate, DeepSpeed, FSDP, Megatron-Core) and helps you choose the right strategy for your model size, hardware, and performance requirements.

When to Use

  • Training models across multiple GPUs or nodes
  • Model doesn't fit in single GPU memory
  • Need to reduce training time for large datasets
  • Choosing between parallelism strategies
  • Optimizing training throughput and efficiency
  • Setting up multi-node training infrastructure

Quick Start

# Simplest path: HuggingFace Accelerate pip install accelerate accelerate config # Interactive setup wizard accelerate launch train.py # PyTorch native DDP torchrun --nproc_per_node=4 train.py # DeepSpeed pip install deepspeed deepspeed train.py --deepspeed ds_config.json

Parallelism Strategies

Data Parallelism (DP/DDP)

# PyTorch DDP — replicate model, split data import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP dist.init_process_group('nccl') model = DDP(model.to(rank), device_ids=[rank]) sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=world_size, rank=rank) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32) for batch in dataloader: loss = model(batch) loss.backward() # Gradients synchronized automatically optimizer.step()

Fully Sharded Data Parallel (FSDP)

# PyTorch FSDP — shard parameters across GPUs from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp import MixedPrecision mp_policy = MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, ) model = FSDP( model, mixed_precision=mp_policy, auto_wrap_policy=transformer_auto_wrap_policy, sharding_strategy=ShardingStrategy.FULL_SHARD, )

Model Parallelism (Tensor + Pipeline)

Tensor Parallel: Split layers horizontally
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  GPU 0       │    │  GPU 1       │
│  Rows 0-2047 │    │  Rows 2048+  │
│  of Linear   │    │  of Linear   │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Pipeline Parallel: Split layers vertically
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  GPU 0       │    │  GPU 1       │
│  Layers 0-15 │───→│  Layers 16-31│
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Strategy Selection Guide

Model SizeGPUs AvailableRecommended Strategy
< 1B1-8DDP + bf16
1-7B2-8FSDP or DeepSpeed ZeRO-2
7-13B4-8FSDP or DeepSpeed ZeRO-2/3
13-70B8-32DeepSpeed ZeRO-3 or Megatron TP+PP
70B+32+Megatron 3D parallelism
MoE64+Megatron + Expert parallelism

Framework Comparison

FeaturePyTorch DDPFSDPDeepSpeedAccelerateMegatron
Ease of useMediumMediumEasy (config)EasiestHard
Max model sizeGPU memoryNode memoryUnlimitedDepends on backendUnlimited
PerformanceGoodGoodVery goodGoodBest
Custom modelsFull supportFull supportFull supportFull supportLimited
HF integrationManualGoodExcellentBuilt-inLimited
CommunityLargeGrowingLargeLargeSmall

Performance Optimization

Communication Optimization

# Overlap communication with computation # DeepSpeed { "zero_optimization": { "overlap_comm": true } } # PyTorch DDP model = DDP(model, device_ids=[rank], find_unused_parameters=False) # Set find_unused_parameters=False when all params are used every forward pass # NCCL environment tuning os.environ['NCCL_ALGO'] = 'Ring' # or 'Tree' for large clusters os.environ['NCCL_SOCKET_IFNAME'] = 'eth0' # Network interface

Data Loading

# Efficient distributed data loading dataloader = DataLoader( dataset, batch_size=per_gpu_batch, sampler=DistributedSampler(dataset), num_workers=4, pin_memory=True, prefetch_factor=2, persistent_workers=True, )

Best Practices

  1. Start simple, scale up — Begin with DDP, move to FSDP/DeepSpeed only when needed
  2. Profile before optimizing — Measure GPU utilization, communication overhead, data loading time
  3. Match TP to NVLink topology — Keep tensor parallel within NVLink-connected GPUs
  4. Use gradient accumulation — Simulate larger batches without more memory
  5. Checkpoint frequently — Save every 1000 steps; large runs are expensive to restart
  6. Use bf16 universally — Simpler than fp16 (no loss scaling) with similar memory savings
  7. Monitor GPU memory — Use nvidia-smi or torch.cuda.memory_summary() to track usage
  8. Scale learning rate — Linear scaling rule: multiply LR by the total batch size increase
  9. Warm up learning rate — Always warm up for 1-5% of total steps in distributed training
  10. Test at small scale first — Verify training loop with 1-2 GPUs before launching full runs

Troubleshooting

NCCL errors on multi-node

# Set correct network interface export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_DISABLE=0 # Enable InfiniBand if available export NCCL_DEBUG=INFO # Verbose debugging

Gradient explosion after scaling

# Apply gradient clipping torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # And warm up learning rate scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=1000)

Uneven GPU memory usage

# With pipeline parallelism, first and last stages use more memory # Balance by assigning fewer layers to first/last stages # Or use virtual pipeline stages for better load balancing
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates