Distributed Training Accelerate Studio
Comprehensive skill designed for simplest, distributed, training, lines. Includes structured workflows, validation checks, and reusable patterns for ai research.
Distributed Training with HuggingFace Accelerate
Overview
A comprehensive skill for distributed model training using HuggingFace Accelerate — the library that lets you run the same PyTorch training code on any distributed setup with minimal changes. Covers multi-GPU, multi-node, mixed precision, DeepSpeed integration, FSDP, and TPU training with just 4 lines of code changes to your existing scripts.
When to Use
- Converting single-GPU PyTorch scripts to multi-GPU
- Training across multiple machines/nodes
- Adding mixed precision (fp16/bf16) training
- Integrating DeepSpeed ZeRO stages without rewriting code
- Using FSDP (Fully Sharded Data Parallel) with PyTorch
- Training on TPU pods
- Need a simple API that wraps complexity
Quick Start
# Install pip install accelerate # Interactive configuration accelerate config # → Answers questions about your setup (GPUs, nodes, precision, etc.) # Launch training accelerate launch train.py # Or specify directly accelerate launch --num_processes=4 --mixed_precision=bf16 train.py
# Convert ANY PyTorch script — just 4 lines import torch from accelerate import Accelerator # Line 1 accelerator = Accelerator() # Line 2 model = torch.nn.TransformerEncoder(...) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) # Line 3: Prepare wraps everything for distributed model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) for epoch in range(10): for batch in dataloader: optimizer.zero_grad() loss = model(batch) accelerator.backward(loss) # Line 4: replaces loss.backward() optimizer.step()
Core Concepts
1. The Accelerator Object
from accelerate import Accelerator accelerator = Accelerator( mixed_precision='bf16', # 'no', 'fp16', 'bf16' gradient_accumulation_steps=4, # Effective batch = batch_size * 4 * num_GPUs log_with='wandb', # Logging integration project_dir='./outputs', # Save directory ) # Access device info print(f"Device: {accelerator.device}") print(f"Num processes: {accelerator.num_processes}") print(f"Process index: {accelerator.process_index}") print(f"Is main process: {accelerator.is_main_process}")
2. Gradient Accumulation
accelerator = Accelerator(gradient_accumulation_steps=4) for batch in dataloader: with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() scheduler.step() optimizer.zero_grad() # Accelerate handles accumulation automatically — # optimizer.step() only runs every 4 steps
3. Mixed Precision Training
# Automatic mixed precision — just set the flag accelerator = Accelerator(mixed_precision='bf16') # Or configure per-model with GradScaler from accelerate import Accelerator from accelerate.utils import set_seed set_seed(42) accelerator = Accelerator(mixed_precision='fp16') # Accelerate automatically: # - Casts model forward pass to fp16/bf16 # - Keeps optimizer states in fp32 # - Handles gradient scaling for fp16
4. Saving & Loading Checkpoints
# Save checkpoint (only on main process) if accelerator.is_main_process: accelerator.save_state('./checkpoint-epoch-5') # Load checkpoint accelerator.load_state('./checkpoint-epoch-5') # Save the unwrapped model for inference accelerator.wait_for_everyone() unwrapped = accelerator.unwrap_model(model) accelerator.save(unwrapped.state_dict(), './model_weights.pt')
5. DeepSpeed Integration
# accelerate_config.yaml compute_environment: LOCAL_MACHINE distributed_type: DEEPSPEED deepspeed_config: zero_stage: 2 offload_optimizer_device: cpu gradient_accumulation_steps: 4 gradient_clipping: 1.0 zero3_init_flag: false mixed_precision: bf16 num_processes: 4
# Launch with DeepSpeed accelerate launch --config_file accelerate_config.yaml train.py
6. FSDP (Fully Sharded Data Parallel)
# FSDP config compute_environment: LOCAL_MACHINE distributed_type: FSDP fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT mixed_precision: bf16 num_processes: 8
Configuration Reference
| Parameter | Values | Description |
|---|---|---|
mixed_precision | no, fp16, bf16 | Precision mode |
gradient_accumulation_steps | 1-128 | Steps before optimizer update |
distributed_type | NO, MULTI_GPU, DEEPSPEED, FSDP, TPU | Distribution strategy |
num_processes | 1-N | Number of GPU processes |
num_machines | 1-N | Number of nodes |
deepspeed_config.zero_stage | 0, 1, 2, 3 | ZeRO optimization level |
fsdp_sharding_strategy | FULL_SHARD, SHARD_GRAD_OP, NO_SHARD | FSDP sharding |
dynamo_backend | NO, INDUCTOR, EAGER | torch.compile backend |
Training Strategy Decision Tree
| GPUs | Model Size | Strategy | Config |
|---|---|---|---|
| 1 | <10B | Standard + bf16 | mixed_precision: bf16 |
| 2-8 | <10B | DDP + bf16 | distributed_type: MULTI_GPU |
| 2-8 | 10-30B | FSDP or ZeRO-2 | Shard optimizer + gradients |
| 2-8 | 30-70B | ZeRO-3 or FSDP Full | Shard everything |
| 8+ | 70B+ | ZeRO-3 + CPU offload | Multi-node with offloading |
Best Practices
- Start with
accelerate config— Interactive wizard prevents configuration errors - Use bf16 over fp16 — Better numerical stability, no gradient scaling needed
- Scale batch size linearly — Effective batch = per-GPU batch × num_GPUs × gradient_accumulation
- Save only on main process — Avoid checkpoint corruption with
if accelerator.is_main_process: - Use gradient accumulation — Simulate larger batches without more memory
- Test on single GPU first — Debug your script before scaling out
- Profile before optimizing — Use
accelerate benchmarkto find bottlenecks - Pin memory in dataloaders — Set
pin_memory=Truefor faster CPU→GPU transfers - Use
accelerator.print()— Only prints on main process, avoids duplicate logs - Unwrap model for inference — Use
accelerator.unwrap_model()to get the clean model
Troubleshooting
NCCL timeout errors
# Increase timeout export NCCL_TIMEOUT=1800 # 30 minutes # Or set in config accelerate launch --main_process_timeout=1800 train.py
OOM on multi-GPU
# Reduce per-GPU batch and increase gradient accumulation accelerator = Accelerator(gradient_accumulation_steps=8) # per-GPU batch=4, accumulation=8, 4 GPUs → effective batch = 4*8*4 = 128
Metrics not aggregating correctly
# Gather metrics across processes loss_gathered = accelerator.gather(loss) if accelerator.is_main_process: avg_loss = loss_gathered.mean().item() print(f"Average loss: {avg_loss}")
Hanging on multi-node
# Ensure network connectivity and set correct addresses accelerate launch \ --num_machines=2 \ --machine_rank=0 \ --main_process_ip=10.0.0.1 \ --main_process_port=29500 \ train.py
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.