D

Distributed Training Accelerate Studio

Comprehensive skill designed for simplest, distributed, training, lines. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Distributed Training with HuggingFace Accelerate

Overview

A comprehensive skill for distributed model training using HuggingFace Accelerate — the library that lets you run the same PyTorch training code on any distributed setup with minimal changes. Covers multi-GPU, multi-node, mixed precision, DeepSpeed integration, FSDP, and TPU training with just 4 lines of code changes to your existing scripts.

When to Use

  • Converting single-GPU PyTorch scripts to multi-GPU
  • Training across multiple machines/nodes
  • Adding mixed precision (fp16/bf16) training
  • Integrating DeepSpeed ZeRO stages without rewriting code
  • Using FSDP (Fully Sharded Data Parallel) with PyTorch
  • Training on TPU pods
  • Need a simple API that wraps complexity

Quick Start

# Install pip install accelerate # Interactive configuration accelerate config # → Answers questions about your setup (GPUs, nodes, precision, etc.) # Launch training accelerate launch train.py # Or specify directly accelerate launch --num_processes=4 --mixed_precision=bf16 train.py
# Convert ANY PyTorch script — just 4 lines import torch from accelerate import Accelerator # Line 1 accelerator = Accelerator() # Line 2 model = torch.nn.TransformerEncoder(...) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) # Line 3: Prepare wraps everything for distributed model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) for epoch in range(10): for batch in dataloader: optimizer.zero_grad() loss = model(batch) accelerator.backward(loss) # Line 4: replaces loss.backward() optimizer.step()

Core Concepts

1. The Accelerator Object

from accelerate import Accelerator accelerator = Accelerator( mixed_precision='bf16', # 'no', 'fp16', 'bf16' gradient_accumulation_steps=4, # Effective batch = batch_size * 4 * num_GPUs log_with='wandb', # Logging integration project_dir='./outputs', # Save directory ) # Access device info print(f"Device: {accelerator.device}") print(f"Num processes: {accelerator.num_processes}") print(f"Process index: {accelerator.process_index}") print(f"Is main process: {accelerator.is_main_process}")

2. Gradient Accumulation

accelerator = Accelerator(gradient_accumulation_steps=4) for batch in dataloader: with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() scheduler.step() optimizer.zero_grad() # Accelerate handles accumulation automatically — # optimizer.step() only runs every 4 steps

3. Mixed Precision Training

# Automatic mixed precision — just set the flag accelerator = Accelerator(mixed_precision='bf16') # Or configure per-model with GradScaler from accelerate import Accelerator from accelerate.utils import set_seed set_seed(42) accelerator = Accelerator(mixed_precision='fp16') # Accelerate automatically: # - Casts model forward pass to fp16/bf16 # - Keeps optimizer states in fp32 # - Handles gradient scaling for fp16

4. Saving & Loading Checkpoints

# Save checkpoint (only on main process) if accelerator.is_main_process: accelerator.save_state('./checkpoint-epoch-5') # Load checkpoint accelerator.load_state('./checkpoint-epoch-5') # Save the unwrapped model for inference accelerator.wait_for_everyone() unwrapped = accelerator.unwrap_model(model) accelerator.save(unwrapped.state_dict(), './model_weights.pt')

5. DeepSpeed Integration

# accelerate_config.yaml compute_environment: LOCAL_MACHINE distributed_type: DEEPSPEED deepspeed_config: zero_stage: 2 offload_optimizer_device: cpu gradient_accumulation_steps: 4 gradient_clipping: 1.0 zero3_init_flag: false mixed_precision: bf16 num_processes: 4
# Launch with DeepSpeed accelerate launch --config_file accelerate_config.yaml train.py

6. FSDP (Fully Sharded Data Parallel)

# FSDP config compute_environment: LOCAL_MACHINE distributed_type: FSDP fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT mixed_precision: bf16 num_processes: 8

Configuration Reference

ParameterValuesDescription
mixed_precisionno, fp16, bf16Precision mode
gradient_accumulation_steps1-128Steps before optimizer update
distributed_typeNO, MULTI_GPU, DEEPSPEED, FSDP, TPUDistribution strategy
num_processes1-NNumber of GPU processes
num_machines1-NNumber of nodes
deepspeed_config.zero_stage0, 1, 2, 3ZeRO optimization level
fsdp_sharding_strategyFULL_SHARD, SHARD_GRAD_OP, NO_SHARDFSDP sharding
dynamo_backendNO, INDUCTOR, EAGERtorch.compile backend

Training Strategy Decision Tree

GPUsModel SizeStrategyConfig
1<10BStandard + bf16mixed_precision: bf16
2-8<10BDDP + bf16distributed_type: MULTI_GPU
2-810-30BFSDP or ZeRO-2Shard optimizer + gradients
2-830-70BZeRO-3 or FSDP FullShard everything
8+70B+ZeRO-3 + CPU offloadMulti-node with offloading

Best Practices

  1. Start with accelerate config — Interactive wizard prevents configuration errors
  2. Use bf16 over fp16 — Better numerical stability, no gradient scaling needed
  3. Scale batch size linearly — Effective batch = per-GPU batch × num_GPUs × gradient_accumulation
  4. Save only on main process — Avoid checkpoint corruption with if accelerator.is_main_process:
  5. Use gradient accumulation — Simulate larger batches without more memory
  6. Test on single GPU first — Debug your script before scaling out
  7. Profile before optimizing — Use accelerate benchmark to find bottlenecks
  8. Pin memory in dataloaders — Set pin_memory=True for faster CPU→GPU transfers
  9. Use accelerator.print() — Only prints on main process, avoids duplicate logs
  10. Unwrap model for inference — Use accelerator.unwrap_model() to get the clean model

Troubleshooting

NCCL timeout errors

# Increase timeout export NCCL_TIMEOUT=1800 # 30 minutes # Or set in config accelerate launch --main_process_timeout=1800 train.py

OOM on multi-GPU

# Reduce per-GPU batch and increase gradient accumulation accelerator = Accelerator(gradient_accumulation_steps=8) # per-GPU batch=4, accumulation=8, 4 GPUs → effective batch = 4*8*4 = 128

Metrics not aggregating correctly

# Gather metrics across processes loss_gathered = accelerator.gather(loss) if accelerator.is_main_process: avg_loss = loss_gathered.mean().item() print(f"Average loss: {avg_loss}")

Hanging on multi-node

# Ensure network connectivity and set correct addresses accelerate launch \ --num_machines=2 \ --machine_rank=0 \ --main_process_ip=10.0.0.1 \ --main_process_port=29500 \ train.py
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates