Distributed Training Deepspeed Kit
Comprehensive skill designed for expert, guidance, distributed, training. Includes structured workflows, validation checks, and reusable patterns for ai research.
Distributed Training with Microsoft DeepSpeed
Overview
A comprehensive skill for large-scale model training using Microsoft DeepSpeed — the optimization library that enables training models with trillions of parameters through ZeRO (Zero Redundancy Optimizer), efficient inference, and advanced compression techniques. DeepSpeed makes training 100B+ parameter models feasible on commodity hardware through intelligent memory partitioning and communication optimization.
When to Use
- Training models too large to fit on a single GPU
- Reducing GPU memory usage with ZeRO optimization stages
- Offloading optimizer states and parameters to CPU/NVMe
- Mixed precision training with automatic loss scaling
- Pipeline parallelism for very large models
- Serving large models with DeepSpeed-Inference
- Compressing models with distillation and pruning
Quick Start
# Install pip install deepspeed # Check system compatibility ds_report # Train with DeepSpeed deepspeed train.py --deepspeed ds_config.json # Or use with HuggingFace Transformers deepspeed train.py \ --deepspeed ds_config.json \ --model_name_or_path meta-llama/Llama-3-8B \ --output_dir ./output
# Minimal DeepSpeed training import deepspeed import torch model = MyModel() ds_config = { "train_batch_size": 32, "gradient_accumulation_steps": 4, "fp16": {"enabled": True}, "zero_optimization": {"stage": 2}, } model_engine, optimizer, _, _ = deepspeed.initialize( model=model, config=ds_config, ) for batch in dataloader: loss = model_engine(batch) model_engine.backward(loss) model_engine.step()
ZeRO Optimization Stages
Stage 0: Disabled (DDP equivalent)
{ "zero_optimization": { "stage": 0 } }
Standard data parallelism. Each GPU holds full model, optimizer, gradients.
Stage 1: Optimizer State Partitioning
{ "zero_optimization": { "stage": 1, "reduce_bucket_size": 5e8 } }
Partitions optimizer states (e.g., Adam momentum/variance). ~4x memory reduction for optimizer.
Stage 2: Gradient + Optimizer Partitioning
{ "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients": true } }
Partitions both gradients and optimizer states. ~8x memory reduction.
Stage 3: Full Parameter Partitioning
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "stage3_prefetch_bucket_size": 5e7, "stage3_param_persistence_threshold": 1e5, "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9 } }
Partitions everything — can train models larger than single GPU memory.
Memory Usage by ZeRO Stage
| Component | Stage 0 | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|---|
| Parameters (Ψ) | N×Ψ | N×Ψ | N×Ψ | Ψ/N |
| Gradients | N×Ψ | N×Ψ | Ψ/N | Ψ/N |
| Optimizer States | N×12Ψ | 12Ψ/N | 12Ψ/N | 12Ψ/N |
| Total per GPU | 16Ψ | 4Ψ + 12Ψ/N | 2Ψ + 14Ψ/N | 16Ψ/N |
N = number of GPUs, Ψ = parameter count in bytes
Full Training Configuration
{ "train_batch_size": 64, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 4, "steps_per_print": 100, "optimizer": { "type": "AdamW", "params": { "lr": 1e-4, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.01 } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 1e-4, "warmup_num_steps": 1000, "total_num_steps": 50000 } }, "bf16": { "enabled": true }, "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": 5e8, "allgather_bucket_size": 5e8 }, "gradient_clipping": 1.0, "wall_clock_breakdown": false, "activation_checkpointing": { "partition_activations": true, "contiguous_memory_optimization": true, "cpu_checkpointing": false } }
Common Workflows
HuggingFace Transformers Integration
from transformers import TrainingArguments, Trainer training_args = TrainingArguments( output_dir='./output', per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=3, learning_rate=2e-5, bf16=True, deepspeed='ds_config.json', # Just point to config ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset, ) trainer.train()
Multi-Node Training
# Node 0 (master) deepspeed --num_gpus=8 --num_nodes=2 --master_addr=10.0.0.1 \ --master_port=29500 --hostfile=hostfile train.py --deepspeed ds_config.json # hostfile 10.0.0.1 slots=8 10.0.0.2 slots=8
NVMe Offloading for Maximum Model Size
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 5 }, "offload_param": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 5, "max_in_cpu": 1e9 } } }
Best Practices
- Start with ZeRO Stage 2 — Best balance of memory savings and performance for most use cases
- Use bf16 over fp16 — Better stability, no loss scaling issues on modern GPUs
- Tune micro batch size — Find the largest batch that fits in GPU memory, then use gradient accumulation
- Enable overlap_comm — Overlaps gradient communication with backward pass for better throughput
- Use activation checkpointing — Trade compute for memory on very large models
- Profile with wall_clock_breakdown — Identify bottlenecks in compute, communication, and data loading
- Monitor GPU utilization — Target >80% GPU utilization; low utilization indicates communication bottleneck
- Use NVMe offload as last resort — CPU offload first, NVMe only when CPU memory is also insufficient
- Match batch sizes carefully —
train_batch_size = micro_batch × gradient_accum × num_GPUs - Test checkpointing early — Verify you can save and resume before starting long training runs
Troubleshooting
CUDA out of memory with ZeRO-3
// Reduce prefetch and persistence buffers { "zero_optimization": { "stage": 3, "stage3_prefetch_bucket_size": 1e7, "stage3_param_persistence_threshold": 1e4, "stage3_max_live_parameters": 1e8 } }
Training speed drops with CPU offload
// Enable async offloading and increase buffer { "zero_optimization": { "offload_optimizer": { "device": "cpu", "pin_memory": true, "fast_init": true } } }
Checkpoint loading fails
# Use tag-based loading for ZeRO checkpoints model_engine, _, _, _ = deepspeed.initialize(model=model, config=ds_config) model_engine.load_checkpoint('./checkpoints', tag='step_10000')
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.