Distributed Training Megatron Studio
All-in-one skill covering trains, large, language, models. Includes structured workflows, validation checks, and reusable patterns for ai research.
Distributed Training with NVIDIA Megatron-Core
Overview
A comprehensive skill for training large language models (2B to 462B+ parameters) using NVIDIA's Megatron-Core framework. Megatron-Core provides advanced 3D parallelism (tensor, pipeline, data), sequence parallelism, expert parallelism for MoE models, and FP8 training support â achieving up to 47% Model FLOP Utilization (MFU) on H100 GPUs.
When to Use
- Training LLMs from scratch at scale (>7B parameters)
- Need 3D parallelism: tensor + pipeline + data parallel
- Training Mixture-of-Experts (MoE) models
- Maximizing GPU utilization on NVIDIA hardware
- FP8 training on H100/H200 GPUs
- Pre-training foundational models
Quick Start
# Docker (recommended) docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3 # Or install from source pip install megatron-core # Launch training â 8 GPUs with 3D parallelism torchrun --nproc_per_node=8 train.py \ --tensor-model-parallel-size 2 \ --pipeline-model-parallel-size 2 \ --num-layers 32 \ --hidden-size 4096 \ --num-attention-heads 32 \ --seq-length 4096 \ --micro-batch-size 4 \ --global-batch-size 256
3D Parallelism Explained
âââââââââââââââââââââââââââââââââââââââ
â 3D Parallelism Overview â
â â
Data Parallel â GPU 0â1 GPU 2â3 GPU 4â5 â
(replicas) â ââââââââ ââââââââ ââââââââ â
â âDP = 0â âDP = 1â âDP = 2â â
â ââââââââ ââââââââ ââââââââ â
â â â â â
Pipeline â Layer Layer Layer â
Parallel â 0-15 0-15 0-15 â
(stages) â 16-31 16-31 16-31 â
â â â â â
Tensor â Head Head Head â
Parallel â 0-15 0-15 0-15 â
(split layers) â 16-31 16-31 16-31 â
âââââââââââââââââââââââââââââââââââââââ
Tensor Parallelism
# Splits individual layers across GPUs # Example: 4096 hidden â 2 GPUs à 2048 each from megatron.core.tensor_parallel import ColumnParallelLinear, RowParallelLinear class ParallelMLP(torch.nn.Module): def __init__(self, hidden_size, ffn_hidden_size): self.dense_h_to_4h = ColumnParallelLinear( hidden_size, ffn_hidden_size, gather_output=False, ) self.dense_4h_to_h = RowParallelLinear( ffn_hidden_size, hidden_size, input_is_parallel=True, )
Pipeline Parallelism
# Splits model layers across GPUs in stages from megatron.core.pipeline_parallel import get_forward_backward_func # Configure pipeline schedule forward_backward_func = get_forward_backward_func() # Interleaved schedule for better GPU utilization losses = forward_backward_func( forward_step_func=forward_step, data_iterator=data_iterator, model=model, num_microbatches=num_microbatches, seq_length=seq_length, micro_batch_size=micro_batch_size, )
Parallelism Configuration Guide
| Model Size | GPUs | TP | PP | DP | Notes |
|---|---|---|---|---|---|
| 7B | 8 | 1 | 1 | 8 | Pure data parallel |
| 13B | 8 | 2 | 1 | 4 | TP across NVLink pairs |
| 30B | 16 | 4 | 2 | 2 | TP within node, PP across |
| 70B | 32 | 4 | 4 | 2 | Full 3D parallelism |
| 175B | 64 | 8 | 8 | 1 | Maximum TP within node |
| 462B (MoE) | 256 | 8 | 8 | 4 | + Expert parallelism |
Model Configuration
from megatron.core.transformer.transformer_config import TransformerConfig config = TransformerConfig( num_layers=32, hidden_size=4096, num_attention_heads=32, num_query_groups=8, # GQA (Grouped Query Attention) ffn_hidden_size=11008, max_position_embeddings=4096, use_rotary_position_embeddings=True, fp16=False, bf16=True, fp8=True, # H100 FP8 training attention_dropout=0.0, hidden_dropout=0.0, tensor_model_parallel_size=4, pipeline_model_parallel_size=2, sequence_parallel=True, # Save memory on activation recompute_granularity='selective', )
Best Practices
- Keep TP within NVLink domain â Tensor parallel communication is heavy; keep TP size †NVLink GPUs per node (usually 8)
- Use sequence parallelism with TP â Reduces activation memory proportional to TP degree
- Interleave pipeline stages â Virtual pipeline stages improve GPU utilization from ~50% to ~80%
- Use BF16 or FP8 â BF16 is standard; FP8 on H100s gives 2x throughput
- Tune micro-batch size â Larger micro-batch = better GPU utilization but more memory
- Enable selective recomputation â Recompute attention instead of storing all activations
- Monitor MFU â Target >40% MFU; below 30% indicates inefficiency
- Use flash attention â Reduces memory and increases speed for long sequences
- Profile with nsys â NVIDIA Nsight Systems shows communication vs compute balance
- Checkpoint frequently â Large training runs are expensive; save every 1000 steps minimum
Troubleshooting
Low GPU utilization
# Check communication overhead nsys profile --trace=cuda,nvtx torchrun --nproc_per_node=8 train.py # If communication dominates, reduce TP size or increase micro-batch
OOM with pipeline parallelism
# Reduce micro-batch size and increase accumulation --micro-batch-size 1 --global-batch-size 256 # Or enable activation recomputation --recompute-granularity full --recompute-method uniform
Slow data loading bottleneck
# Use multiple workers and prefetch --num-workers 8 --dataloader-type cyclic --data-cache-path /local_ssd/cache
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.