Miles Post-Training Engine

Enterprise-grade reinforcement learning framework for large-scale model post-training, optimized for MoE models (1TB+), FP8/INT4 quantization-aware training, and bit-wise identical train-inference alignment.

When to Use

Choose Miles when:

Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
Need FP8 or INT4 quantization-aware training
Require bit-wise identical results between training and inference
Production RL at enterprise scale with fault tolerance
Multi-node distributed training across 100+ GPUs

Consider alternatives when:

Smaller models (< 70B parameters) → use TRL with GRPO
Research experiments without production requirements → use OpenRLHF
PyTorch-native approach without Ray → use torchforge
Simple RLHF/DPO fine-tuning → use TRL directly

Quick Start

Installation


# Clone and install
git clone https://github.com/miles-ai/miles.git
cd miles
pip install -e .

# With distributed support
pip install -e ".[distributed]"

Basic Training Run


from miles import MilesTrainer, MilesConfig

config = MilesConfig(
    model_name="deepseek-ai/DeepSeek-V3",
    training_method="grpo",
    precision="fp8",
    num_nodes=4,
    gpus_per_node=8,
    batch_size=256,
    reward_model="deepseek-ai/DeepSeek-V3-RM",
)

trainer = MilesTrainer(config)
trainer.train(dataset="your-dataset")

Multi-Node Launch


# Launch across 4 nodes with 8 GPUs each
miles launch \
  --config configs/deepseek-v3-grpo.yaml \
  --nodes 4 \
  --gpus-per-node 8 \
  --precision fp8

Core Concepts

MoE Training Stability

Miles addresses critical challenges in Mixture-of-Experts training:

Expert load balancing during RL training
Gradient scaling across experts
Router stability under policy updates
Memory-efficient expert parallelism

FP8 Training Pipeline


config = MilesConfig(
    precision="fp8",
    fp8_config={
        "amax_history_len": 1024,
        "amax_compute_algo": "max",
        "fp8_format": "e4m3",
    }
)

Train-Inference Alignment

Miles guarantees bit-wise identical outputs between training and inference by:

Using identical numerical implementations across both pipelines
Synchronizing RNG states across distributed workers
Matching attention implementations exactly

Configuration

Parameter	Default	Description
`training_method`	"grpo"	RL algorithm (grpo, ppo, dpo, reinforce)
`precision`	"bf16"	Training precision (fp8, bf16, fp32)
`num_nodes`	1	Number of compute nodes
`gpus_per_node`	8	GPUs per node
`batch_size`	128	Global batch size
`max_seq_len`	4096	Maximum sequence length
`checkpoint_interval`	1000	Steps between checkpoints
`fault_tolerance`	True	Auto-recovery from failures

Best Practices

Use FP8 for 1TB+ models — reduces memory by 2x vs BF16 with minimal accuracy loss
Enable fault tolerance for long training runs — auto-recovery saves days of re-training
Monitor expert load balance during MoE training to catch routing collapse early
Use the built-in evaluation suite to track reward model agreement throughout training
Start with smaller batch sizes and scale up — ensures stability before committing resources
Validate train-inference alignment with the included verification scripts

Common Issues

Expert routing collapse: Increase the load balance loss coefficient. Monitor per-expert utilization and restart with adjusted auxiliary loss if any expert drops below 5% utilization.

FP8 training divergence: Check amax history length — too short causes scale oscillation. Switch to e4m3 format if e5m2 shows instability. Fall back to BF16 for the first 1000 steps before switching to FP8.

Out of memory on large MoE models: Use expert parallelism across GPUs. Reduce max_seq_len or enable gradient checkpointing. Consider pipeline parallelism for models exceeding single-node memory.

⚠️ Loading Issue

Post Training Miles Engine

Miles Post-Training Engine

When to Use

Quick Start

Installation

Basic Training Run

Multi-Node Launch

Core Concepts

MoE Training Stability

FP8 Training Pipeline

Train-Inference Alignment

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace