Distributed Training with NVIDIA Megatron-Core

Overview

A comprehensive skill for training large language models (2B to 462B+ parameters) using NVIDIA's Megatron-Core framework. Megatron-Core provides advanced 3D parallelism (tensor, pipeline, data), sequence parallelism, expert parallelism for MoE models, and FP8 training support — achieving up to 47% Model FLOP Utilization (MFU) on H100 GPUs.

When to Use

Training LLMs from scratch at scale (>7B parameters)
Need 3D parallelism: tensor + pipeline + data parallel
Training Mixture-of-Experts (MoE) models
Maximizing GPU utilization on NVIDIA hardware
FP8 training on H100/H200 GPUs
Pre-training foundational models

Quick Start


# Docker (recommended)
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3

# Or install from source
pip install megatron-core

# Launch training — 8 GPUs with 3D parallelism
torchrun --nproc_per_node=8 train.py \
  --tensor-model-parallel-size 2 \
  --pipeline-model-parallel-size 2 \
  --num-layers 32 \
  --hidden-size 4096 \
  --num-attention-heads 32 \
  --seq-length 4096 \
  --micro-batch-size 4 \
  --global-batch-size 256

3D Parallelism Explained

                    ┌─────────────────────────────────────┐
                    │        3D Parallelism Overview       │
                    │                                     │
   Data Parallel    │  GPU 0─1    GPU 2─3    GPU 4─5     │
   (replicas)       │  ┌──────┐  ┌──────┐  ┌──────┐     │
                    │  │DP = 0│  │DP = 1│  │DP = 2│     │
                    │  └──────┘  └──────┘  └──────┘     │
                    │     │          │          │         │
   Pipeline         │  Layer    Layer      Layer         │
   Parallel         │  0-15    0-15       0-15          │
   (stages)         │  16-31   16-31      16-31         │
                    │     │          │          │         │
   Tensor           │  Head     Head       Head          │
   Parallel         │  0-15    0-15       0-15          │
   (split layers)   │  16-31   16-31      16-31         │
                    └─────────────────────────────────────┘

Tensor Parallelism


# Splits individual layers across GPUs
# Example: 4096 hidden → 2 GPUs × 2048 each
from megatron.core.tensor_parallel import ColumnParallelLinear, RowParallelLinear

class ParallelMLP(torch.nn.Module):
    def __init__(self, hidden_size, ffn_hidden_size):
        self.dense_h_to_4h = ColumnParallelLinear(
            hidden_size, ffn_hidden_size,
            gather_output=False,
        )
        self.dense_4h_to_h = RowParallelLinear(
            ffn_hidden_size, hidden_size,
            input_is_parallel=True,
        )

Pipeline Parallelism


# Splits model layers across GPUs in stages
from megatron.core.pipeline_parallel import get_forward_backward_func

# Configure pipeline schedule
forward_backward_func = get_forward_backward_func()

# Interleaved schedule for better GPU utilization
losses = forward_backward_func(
    forward_step_func=forward_step,
    data_iterator=data_iterator,
    model=model,
    num_microbatches=num_microbatches,
    seq_length=seq_length,
    micro_batch_size=micro_batch_size,
)

Parallelism Configuration Guide

Model Size	GPUs	TP	PP	DP	Notes
7B	8	1	1	8	Pure data parallel
13B	8	2	1	4	TP across NVLink pairs
30B	16	4	2	2	TP within node, PP across
70B	32	4	4	2	Full 3D parallelism
175B	64	8	8	1	Maximum TP within node
462B (MoE)	256	8	8	4	+ Expert parallelism

Model Configuration


from megatron.core.transformer.transformer_config import TransformerConfig

config = TransformerConfig(
    num_layers=32,
    hidden_size=4096,
    num_attention_heads=32,
    num_query_groups=8,            # GQA (Grouped Query Attention)
    ffn_hidden_size=11008,
    max_position_embeddings=4096,
    use_rotary_position_embeddings=True,
    fp16=False,
    bf16=True,
    fp8=True,                      # H100 FP8 training
    attention_dropout=0.0,
    hidden_dropout=0.0,
    tensor_model_parallel_size=4,
    pipeline_model_parallel_size=2,
    sequence_parallel=True,        # Save memory on activation
    recompute_granularity='selective',
)

Best Practices

Keep TP within NVLink domain — Tensor parallel communication is heavy; keep TP size ≤ NVLink GPUs per node (usually 8)
Use sequence parallelism with TP — Reduces activation memory proportional to TP degree
Interleave pipeline stages — Virtual pipeline stages improve GPU utilization from ~50% to ~80%
Use BF16 or FP8 — BF16 is standard; FP8 on H100s gives 2x throughput
Tune micro-batch size — Larger micro-batch = better GPU utilization but more memory
Enable selective recomputation — Recompute attention instead of storing all activations
Monitor MFU — Target >40% MFU; below 30% indicates inefficiency
Use flash attention — Reduces memory and increases speed for long sequences
Profile with nsys — NVIDIA Nsight Systems shows communication vs compute balance
Checkpoint frequently — Large training runs are expensive; save every 1000 steps minimum

Troubleshooting

Low GPU utilization


# Check communication overhead
nsys profile --trace=cuda,nvtx torchrun --nproc_per_node=8 train.py
# If communication dominates, reduce TP size or increase micro-batch

OOM with pipeline parallelism


# Reduce micro-batch size and increase accumulation
--micro-batch-size 1 --global-batch-size 256
# Or enable activation recomputation
--recompute-granularity full --recompute-method uniform

Slow data loading bottleneck


# Use multiple workers and prefetch
--num-workers 8 --dataloader-type cyclic --data-cache-path /local_ssd/cache

⚠️ Loading Issue

Distributed Training Megatron Studio