D

Distributed Training Deepspeed Kit

Comprehensive skill designed for expert, guidance, distributed, training. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Distributed Training with Microsoft DeepSpeed

Overview

A comprehensive skill for large-scale model training using Microsoft DeepSpeed — the optimization library that enables training models with trillions of parameters through ZeRO (Zero Redundancy Optimizer), efficient inference, and advanced compression techniques. DeepSpeed makes training 100B+ parameter models feasible on commodity hardware through intelligent memory partitioning and communication optimization.

When to Use

  • Training models too large to fit on a single GPU
  • Reducing GPU memory usage with ZeRO optimization stages
  • Offloading optimizer states and parameters to CPU/NVMe
  • Mixed precision training with automatic loss scaling
  • Pipeline parallelism for very large models
  • Serving large models with DeepSpeed-Inference
  • Compressing models with distillation and pruning

Quick Start

# Install pip install deepspeed # Check system compatibility ds_report # Train with DeepSpeed deepspeed train.py --deepspeed ds_config.json # Or use with HuggingFace Transformers deepspeed train.py \ --deepspeed ds_config.json \ --model_name_or_path meta-llama/Llama-3-8B \ --output_dir ./output
# Minimal DeepSpeed training import deepspeed import torch model = MyModel() ds_config = { "train_batch_size": 32, "gradient_accumulation_steps": 4, "fp16": {"enabled": True}, "zero_optimization": {"stage": 2}, } model_engine, optimizer, _, _ = deepspeed.initialize( model=model, config=ds_config, ) for batch in dataloader: loss = model_engine(batch) model_engine.backward(loss) model_engine.step()

ZeRO Optimization Stages

Stage 0: Disabled (DDP equivalent)

{ "zero_optimization": { "stage": 0 } }

Standard data parallelism. Each GPU holds full model, optimizer, gradients.

Stage 1: Optimizer State Partitioning

{ "zero_optimization": { "stage": 1, "reduce_bucket_size": 5e8 } }

Partitions optimizer states (e.g., Adam momentum/variance). ~4x memory reduction for optimizer.

Stage 2: Gradient + Optimizer Partitioning

{ "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients": true } }

Partitions both gradients and optimizer states. ~8x memory reduction.

Stage 3: Full Parameter Partitioning

{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "stage3_prefetch_bucket_size": 5e7, "stage3_param_persistence_threshold": 1e5, "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9 } }

Partitions everything — can train models larger than single GPU memory.

Memory Usage by ZeRO Stage

ComponentStage 0Stage 1Stage 2Stage 3
Parameters (Ψ)N×ΨN×ΨN×ΨΨ/N
GradientsN×ΨN×ΨΨ/NΨ/N
Optimizer StatesN×12Ψ12Ψ/N12Ψ/N12Ψ/N
Total per GPU16Ψ4Ψ + 12Ψ/N2Ψ + 14Ψ/N16Ψ/N

N = number of GPUs, Ψ = parameter count in bytes

Full Training Configuration

{ "train_batch_size": 64, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 4, "steps_per_print": 100, "optimizer": { "type": "AdamW", "params": { "lr": 1e-4, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.01 } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 1e-4, "warmup_num_steps": 1000, "total_num_steps": 50000 } }, "bf16": { "enabled": true }, "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": 5e8, "allgather_bucket_size": 5e8 }, "gradient_clipping": 1.0, "wall_clock_breakdown": false, "activation_checkpointing": { "partition_activations": true, "contiguous_memory_optimization": true, "cpu_checkpointing": false } }

Common Workflows

HuggingFace Transformers Integration

from transformers import TrainingArguments, Trainer training_args = TrainingArguments( output_dir='./output', per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=3, learning_rate=2e-5, bf16=True, deepspeed='ds_config.json', # Just point to config ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset, ) trainer.train()

Multi-Node Training

# Node 0 (master) deepspeed --num_gpus=8 --num_nodes=2 --master_addr=10.0.0.1 \ --master_port=29500 --hostfile=hostfile train.py --deepspeed ds_config.json # hostfile 10.0.0.1 slots=8 10.0.0.2 slots=8

NVMe Offloading for Maximum Model Size

{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 5 }, "offload_param": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 5, "max_in_cpu": 1e9 } } }

Best Practices

  1. Start with ZeRO Stage 2 — Best balance of memory savings and performance for most use cases
  2. Use bf16 over fp16 — Better stability, no loss scaling issues on modern GPUs
  3. Tune micro batch size — Find the largest batch that fits in GPU memory, then use gradient accumulation
  4. Enable overlap_comm — Overlaps gradient communication with backward pass for better throughput
  5. Use activation checkpointing — Trade compute for memory on very large models
  6. Profile with wall_clock_breakdown — Identify bottlenecks in compute, communication, and data loading
  7. Monitor GPU utilization — Target >80% GPU utilization; low utilization indicates communication bottleneck
  8. Use NVMe offload as last resort — CPU offload first, NVMe only when CPU memory is also insufficient
  9. Match batch sizes carefullytrain_batch_size = micro_batch × gradient_accum × num_GPUs
  10. Test checkpointing early — Verify you can save and resume before starting long training runs

Troubleshooting

CUDA out of memory with ZeRO-3

// Reduce prefetch and persistence buffers { "zero_optimization": { "stage": 3, "stage3_prefetch_bucket_size": 1e7, "stage3_param_persistence_threshold": 1e4, "stage3_max_live_parameters": 1e8 } }

Training speed drops with CPU offload

// Enable async offloading and increase buffer { "zero_optimization": { "offload_optimizer": { "device": "cpu", "pin_memory": true, "fast_init": true } } }

Checkpoint loading fails

# Use tag-based loading for ZeRO checkpoints model_engine, _, _, _ = deepspeed.initialize(model=model, config=ds_config) model_engine.load_checkpoint('./checkpoints', tag='step_10000')
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates