D

Distributed Training Megatron Studio

All-in-one skill covering trains, large, language, models. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Distributed Training with NVIDIA Megatron-Core

Overview

A comprehensive skill for training large language models (2B to 462B+ parameters) using NVIDIA's Megatron-Core framework. Megatron-Core provides advanced 3D parallelism (tensor, pipeline, data), sequence parallelism, expert parallelism for MoE models, and FP8 training support — achieving up to 47% Model FLOP Utilization (MFU) on H100 GPUs.

When to Use

  • Training LLMs from scratch at scale (>7B parameters)
  • Need 3D parallelism: tensor + pipeline + data parallel
  • Training Mixture-of-Experts (MoE) models
  • Maximizing GPU utilization on NVIDIA hardware
  • FP8 training on H100/H200 GPUs
  • Pre-training foundational models

Quick Start

# Docker (recommended) docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3 # Or install from source pip install megatron-core # Launch training — 8 GPUs with 3D parallelism torchrun --nproc_per_node=8 train.py \ --tensor-model-parallel-size 2 \ --pipeline-model-parallel-size 2 \ --num-layers 32 \ --hidden-size 4096 \ --num-attention-heads 32 \ --seq-length 4096 \ --micro-batch-size 4 \ --global-batch-size 256

3D Parallelism Explained

                    ┌─────────────────────────────────────┐
                    │        3D Parallelism Overview       │
                    │                                     │
   Data Parallel    │  GPU 0─1    GPU 2─3    GPU 4─5     │
   (replicas)       │  ┌──────┐  ┌──────┐  ┌──────┐     │
                    │  │DP = 0│  │DP = 1│  │DP = 2│     │
                    │  └──────┘  └──────┘  └──────┘     │
                    │     │          │          │         │
   Pipeline         │  Layer    Layer      Layer         │
   Parallel         │  0-15    0-15       0-15          │
   (stages)         │  16-31   16-31      16-31         │
                    │     │          │          │         │
   Tensor           │  Head     Head       Head          │
   Parallel         │  0-15    0-15       0-15          │
   (split layers)   │  16-31   16-31      16-31         │
                    └─────────────────────────────────────┘

Tensor Parallelism

# Splits individual layers across GPUs # Example: 4096 hidden → 2 GPUs × 2048 each from megatron.core.tensor_parallel import ColumnParallelLinear, RowParallelLinear class ParallelMLP(torch.nn.Module): def __init__(self, hidden_size, ffn_hidden_size): self.dense_h_to_4h = ColumnParallelLinear( hidden_size, ffn_hidden_size, gather_output=False, ) self.dense_4h_to_h = RowParallelLinear( ffn_hidden_size, hidden_size, input_is_parallel=True, )

Pipeline Parallelism

# Splits model layers across GPUs in stages from megatron.core.pipeline_parallel import get_forward_backward_func # Configure pipeline schedule forward_backward_func = get_forward_backward_func() # Interleaved schedule for better GPU utilization losses = forward_backward_func( forward_step_func=forward_step, data_iterator=data_iterator, model=model, num_microbatches=num_microbatches, seq_length=seq_length, micro_batch_size=micro_batch_size, )

Parallelism Configuration Guide

Model SizeGPUsTPPPDPNotes
7B8118Pure data parallel
13B8214TP across NVLink pairs
30B16422TP within node, PP across
70B32442Full 3D parallelism
175B64881Maximum TP within node
462B (MoE)256884+ Expert parallelism

Model Configuration

from megatron.core.transformer.transformer_config import TransformerConfig config = TransformerConfig( num_layers=32, hidden_size=4096, num_attention_heads=32, num_query_groups=8, # GQA (Grouped Query Attention) ffn_hidden_size=11008, max_position_embeddings=4096, use_rotary_position_embeddings=True, fp16=False, bf16=True, fp8=True, # H100 FP8 training attention_dropout=0.0, hidden_dropout=0.0, tensor_model_parallel_size=4, pipeline_model_parallel_size=2, sequence_parallel=True, # Save memory on activation recompute_granularity='selective', )

Best Practices

  1. Keep TP within NVLink domain — Tensor parallel communication is heavy; keep TP size ≀ NVLink GPUs per node (usually 8)
  2. Use sequence parallelism with TP — Reduces activation memory proportional to TP degree
  3. Interleave pipeline stages — Virtual pipeline stages improve GPU utilization from ~50% to ~80%
  4. Use BF16 or FP8 — BF16 is standard; FP8 on H100s gives 2x throughput
  5. Tune micro-batch size — Larger micro-batch = better GPU utilization but more memory
  6. Enable selective recomputation — Recompute attention instead of storing all activations
  7. Monitor MFU — Target >40% MFU; below 30% indicates inefficiency
  8. Use flash attention — Reduces memory and increases speed for long sequences
  9. Profile with nsys — NVIDIA Nsight Systems shows communication vs compute balance
  10. Checkpoint frequently — Large training runs are expensive; save every 1000 steps minimum

Troubleshooting

Low GPU utilization

# Check communication overhead nsys profile --trace=cuda,nvtx torchrun --nproc_per_node=8 train.py # If communication dominates, reduce TP size or increase micro-batch

OOM with pipeline parallelism

# Reduce micro-batch size and increase accumulation --micro-batch-size 1 --global-batch-size 256 # Or enable activation recomputation --recompute-granularity full --recompute-method uniform

Slow data loading bottleneck

# Use multiple workers and prefetch --num-workers 8 --dataloader-type cyclic --data-cache-path /local_ssd/cache
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates