Distributed Training with HuggingFace Accelerate

Overview

A comprehensive skill for distributed model training using HuggingFace Accelerate — the library that lets you run the same PyTorch training code on any distributed setup with minimal changes. Covers multi-GPU, multi-node, mixed precision, DeepSpeed integration, FSDP, and TPU training with just 4 lines of code changes to your existing scripts.

When to Use

Converting single-GPU PyTorch scripts to multi-GPU
Training across multiple machines/nodes
Adding mixed precision (fp16/bf16) training
Integrating DeepSpeed ZeRO stages without rewriting code
Using FSDP (Fully Sharded Data Parallel) with PyTorch
Training on TPU pods
Need a simple API that wraps complexity

Quick Start


# Install
pip install accelerate

# Interactive configuration
accelerate config
# → Answers questions about your setup (GPUs, nodes, precision, etc.)

# Launch training
accelerate launch train.py

# Or specify directly
accelerate launch --num_processes=4 --mixed_precision=bf16 train.py


# Convert ANY PyTorch script — just 4 lines
import torch
from accelerate import Accelerator  # Line 1

accelerator = Accelerator()          # Line 2

model = torch.nn.TransformerEncoder(...)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

# Line 3: Prepare wraps everything for distributed
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for epoch in range(10):
    for batch in dataloader:
        optimizer.zero_grad()
        loss = model(batch)
        accelerator.backward(loss)   # Line 4: replaces loss.backward()
        optimizer.step()

Core Concepts

1. The Accelerator Object


from accelerate import Accelerator

accelerator = Accelerator(
    mixed_precision='bf16',           # 'no', 'fp16', 'bf16'
    gradient_accumulation_steps=4,    # Effective batch = batch_size * 4 * num_GPUs
    log_with='wandb',                 # Logging integration
    project_dir='./outputs',          # Save directory
)

# Access device info
print(f"Device: {accelerator.device}")
print(f"Num processes: {accelerator.num_processes}")
print(f"Process index: {accelerator.process_index}")
print(f"Is main process: {accelerator.is_main_process}")

2. Gradient Accumulation


accelerator = Accelerator(gradient_accumulation_steps=4)

for batch in dataloader:
    with accelerator.accumulate(model):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
    # Accelerate handles accumulation automatically —
    # optimizer.step() only runs every 4 steps

3. Mixed Precision Training


# Automatic mixed precision — just set the flag
accelerator = Accelerator(mixed_precision='bf16')

# Or configure per-model with GradScaler
from accelerate import Accelerator
from accelerate.utils import set_seed

set_seed(42)
accelerator = Accelerator(mixed_precision='fp16')

# Accelerate automatically:
# - Casts model forward pass to fp16/bf16
# - Keeps optimizer states in fp32
# - Handles gradient scaling for fp16

4. Saving & Loading Checkpoints


# Save checkpoint (only on main process)
if accelerator.is_main_process:
    accelerator.save_state('./checkpoint-epoch-5')

# Load checkpoint
accelerator.load_state('./checkpoint-epoch-5')

# Save the unwrapped model for inference
accelerator.wait_for_everyone()
unwrapped = accelerator.unwrap_model(model)
accelerator.save(unwrapped.state_dict(), './model_weights.pt')

5. DeepSpeed Integration


# accelerate_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  zero_stage: 2
  offload_optimizer_device: cpu
  gradient_accumulation_steps: 4
  gradient_clipping: 1.0
  zero3_init_flag: false
mixed_precision: bf16
num_processes: 4


# Launch with DeepSpeed
accelerate launch --config_file accelerate_config.yaml train.py

6. FSDP (Fully Sharded Data Parallel)


# FSDP config
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
mixed_precision: bf16
num_processes: 8

Configuration Reference

Parameter	Values	Description
`mixed_precision`	`no`, `fp16`, `bf16`	Precision mode
`gradient_accumulation_steps`	1-128	Steps before optimizer update
`distributed_type`	`NO`, `MULTI_GPU`, `DEEPSPEED`, `FSDP`, `TPU`	Distribution strategy
`num_processes`	1-N	Number of GPU processes
`num_machines`	1-N	Number of nodes
`deepspeed_config.zero_stage`	0, 1, 2, 3	ZeRO optimization level
`fsdp_sharding_strategy`	`FULL_SHARD`, `SHARD_GRAD_OP`, `NO_SHARD`	FSDP sharding
`dynamo_backend`	`NO`, `INDUCTOR`, `EAGER`	torch.compile backend

Training Strategy Decision Tree

GPUs	Model Size	Strategy	Config
1	<10B	Standard + bf16	`mixed_precision: bf16`
2-8	<10B	DDP + bf16	`distributed_type: MULTI_GPU`
2-8	10-30B	FSDP or ZeRO-2	Shard optimizer + gradients
2-8	30-70B	ZeRO-3 or FSDP Full	Shard everything
8+	70B+	ZeRO-3 + CPU offload	Multi-node with offloading

Best Practices

Start with accelerate config — Interactive wizard prevents configuration errors
Use bf16 over fp16 — Better numerical stability, no gradient scaling needed
Scale batch size linearly — Effective batch = per-GPU batch × num_GPUs × gradient_accumulation
Save only on main process — Avoid checkpoint corruption with if accelerator.is_main_process:
Use gradient accumulation — Simulate larger batches without more memory
Test on single GPU first — Debug your script before scaling out
Profile before optimizing — Use accelerate benchmark to find bottlenecks
Pin memory in dataloaders — Set pin_memory=True for faster CPU→GPU transfers
Use accelerator.print() — Only prints on main process, avoids duplicate logs
Unwrap model for inference — Use accelerator.unwrap_model() to get the clean model

Troubleshooting

NCCL timeout errors


# Increase timeout
export NCCL_TIMEOUT=1800  # 30 minutes
# Or set in config
accelerate launch --main_process_timeout=1800 train.py

OOM on multi-GPU


# Reduce per-GPU batch and increase gradient accumulation
accelerator = Accelerator(gradient_accumulation_steps=8)
# per-GPU batch=4, accumulation=8, 4 GPUs → effective batch = 4*8*4 = 128

Metrics not aggregating correctly


# Gather metrics across processes
loss_gathered = accelerator.gather(loss)
if accelerator.is_main_process:
    avg_loss = loss_gathered.mean().item()
    print(f"Average loss: {avg_loss}")

Hanging on multi-node


# Ensure network connectivity and set correct addresses
accelerate launch \
  --num_machines=2 \
  --machine_rank=0 \
  --main_process_ip=10.0.0.1 \
  --main_process_port=29500 \
  train.py

⚠️ Loading Issue

Distributed Training Accelerate Studio