Distributed Training with Microsoft DeepSpeed

Overview

A comprehensive skill for large-scale model training using Microsoft DeepSpeed — the optimization library that enables training models with trillions of parameters through ZeRO (Zero Redundancy Optimizer), efficient inference, and advanced compression techniques. DeepSpeed makes training 100B+ parameter models feasible on commodity hardware through intelligent memory partitioning and communication optimization.

When to Use

Training models too large to fit on a single GPU
Reducing GPU memory usage with ZeRO optimization stages
Offloading optimizer states and parameters to CPU/NVMe
Mixed precision training with automatic loss scaling
Pipeline parallelism for very large models
Serving large models with DeepSpeed-Inference
Compressing models with distillation and pruning

Quick Start


# Install
pip install deepspeed

# Check system compatibility
ds_report

# Train with DeepSpeed
deepspeed train.py --deepspeed ds_config.json

# Or use with HuggingFace Transformers
deepspeed train.py \
  --deepspeed ds_config.json \
  --model_name_or_path meta-llama/Llama-3-8B \
  --output_dir ./output


# Minimal DeepSpeed training
import deepspeed
import torch

model = MyModel()
ds_config = {
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "fp16": {"enabled": True},
    "zero_optimization": {"stage": 2},
}

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config,
)

for batch in dataloader:
    loss = model_engine(batch)
    model_engine.backward(loss)
    model_engine.step()

ZeRO Optimization Stages

Stage 0: Disabled (DDP equivalent)


{ "zero_optimization": { "stage": 0 } }

Standard data parallelism. Each GPU holds full model, optimizer, gradients.

Stage 1: Optimizer State Partitioning


{
  "zero_optimization": {
    "stage": 1,
    "reduce_bucket_size": 5e8
  }
}

Partitions optimizer states (e.g., Adam momentum/variance). ~4x memory reduction for optimizer.

Stage 2: Gradient + Optimizer Partitioning


{
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true
  }
}

Partitions both gradients and optimizer states. ~8x memory reduction.

Stage 3: Full Parameter Partitioning


{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu", "pin_memory": true },
    "offload_param": { "device": "cpu", "pin_memory": true },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "stage3_prefetch_bucket_size": 5e7,
    "stage3_param_persistence_threshold": 1e5,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9
  }
}

Partitions everything — can train models larger than single GPU memory.

Memory Usage by ZeRO Stage

Component	Stage 0	Stage 1	Stage 2	Stage 3
Parameters (Ψ)	N×Ψ	N×Ψ	N×Ψ	Ψ/N
Gradients	N×Ψ	N×Ψ	Ψ/N	Ψ/N
Optimizer States	N×12Ψ	12Ψ/N	12Ψ/N	12Ψ/N
Total per GPU	16Ψ	4Ψ + 12Ψ/N	2Ψ + 14Ψ/N	16Ψ/N

N = number of GPUs, Ψ = parameter count in bytes

Full Training Configuration


{
  "train_batch_size": 64,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 4,
  "steps_per_print": 100,

  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-4,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },

  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 50000
    }
  },

  "bf16": { "enabled": true },

  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8,
    "allgather_bucket_size": 5e8
  },

  "gradient_clipping": 1.0,
  "wall_clock_breakdown": false,

  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true,
    "cpu_checkpointing": false
  }
}

Common Workflows

HuggingFace Transformers Integration


from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./output',
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-5,
    bf16=True,
    deepspeed='ds_config.json',  # Just point to config
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Multi-Node Training


# Node 0 (master)
deepspeed --num_gpus=8 --num_nodes=2 --master_addr=10.0.0.1 \
  --master_port=29500 --hostfile=hostfile train.py --deepspeed ds_config.json

# hostfile
10.0.0.1 slots=8
10.0.0.2 slots=8

NVMe Offloading for Maximum Model Size


{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 5
    },
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 5,
      "max_in_cpu": 1e9
    }
  }
}

Best Practices

Start with ZeRO Stage 2 — Best balance of memory savings and performance for most use cases
Use bf16 over fp16 — Better stability, no loss scaling issues on modern GPUs
Tune micro batch size — Find the largest batch that fits in GPU memory, then use gradient accumulation
Enable overlap_comm — Overlaps gradient communication with backward pass for better throughput
Use activation checkpointing — Trade compute for memory on very large models
Profile with wall_clock_breakdown — Identify bottlenecks in compute, communication, and data loading
Monitor GPU utilization — Target >80% GPU utilization; low utilization indicates communication bottleneck
Use NVMe offload as last resort — CPU offload first, NVMe only when CPU memory is also insufficient
Match batch sizes carefully — train_batch_size = micro_batch × gradient_accum × num_GPUs
Test checkpointing early — Verify you can save and resume before starting long training runs

Troubleshooting

CUDA out of memory with ZeRO-3


// Reduce prefetch and persistence buffers
{
  "zero_optimization": {
    "stage": 3,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e4,
    "stage3_max_live_parameters": 1e8
  }
}

Training speed drops with CPU offload


// Enable async offloading and increase buffer
{
  "zero_optimization": {
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true,
      "fast_init": true
    }
  }
}

Checkpoint loading fails


# Use tag-based loading for ZeRO checkpoints
model_engine, _, _, _ = deepspeed.initialize(model=model, config=ds_config)
model_engine.load_checkpoint('./checkpoints', tag='step_10000')

⚠️ Loading Issue

Distributed Training Deepspeed Kit