P

Post Training Miles Engine

Powerful skill for provides, guidance, enterprise, grade. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Miles Post-Training Engine

Enterprise-grade reinforcement learning framework for large-scale model post-training, optimized for MoE models (1TB+), FP8/INT4 quantization-aware training, and bit-wise identical train-inference alignment.

When to Use

Choose Miles when:

  • Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
  • Need FP8 or INT4 quantization-aware training
  • Require bit-wise identical results between training and inference
  • Production RL at enterprise scale with fault tolerance
  • Multi-node distributed training across 100+ GPUs

Consider alternatives when:

  • Smaller models (< 70B parameters) → use TRL with GRPO
  • Research experiments without production requirements → use OpenRLHF
  • PyTorch-native approach without Ray → use torchforge
  • Simple RLHF/DPO fine-tuning → use TRL directly

Quick Start

Installation

# Clone and install git clone https://github.com/miles-ai/miles.git cd miles pip install -e . # With distributed support pip install -e ".[distributed]"

Basic Training Run

from miles import MilesTrainer, MilesConfig config = MilesConfig( model_name="deepseek-ai/DeepSeek-V3", training_method="grpo", precision="fp8", num_nodes=4, gpus_per_node=8, batch_size=256, reward_model="deepseek-ai/DeepSeek-V3-RM", ) trainer = MilesTrainer(config) trainer.train(dataset="your-dataset")

Multi-Node Launch

# Launch across 4 nodes with 8 GPUs each miles launch \ --config configs/deepseek-v3-grpo.yaml \ --nodes 4 \ --gpus-per-node 8 \ --precision fp8

Core Concepts

MoE Training Stability

Miles addresses critical challenges in Mixture-of-Experts training:

  • Expert load balancing during RL training
  • Gradient scaling across experts
  • Router stability under policy updates
  • Memory-efficient expert parallelism

FP8 Training Pipeline

config = MilesConfig( precision="fp8", fp8_config={ "amax_history_len": 1024, "amax_compute_algo": "max", "fp8_format": "e4m3", } )

Train-Inference Alignment

Miles guarantees bit-wise identical outputs between training and inference by:

  • Using identical numerical implementations across both pipelines
  • Synchronizing RNG states across distributed workers
  • Matching attention implementations exactly

Configuration

ParameterDefaultDescription
training_method"grpo"RL algorithm (grpo, ppo, dpo, reinforce)
precision"bf16"Training precision (fp8, bf16, fp32)
num_nodes1Number of compute nodes
gpus_per_node8GPUs per node
batch_size128Global batch size
max_seq_len4096Maximum sequence length
checkpoint_interval1000Steps between checkpoints
fault_toleranceTrueAuto-recovery from failures

Best Practices

  1. Use FP8 for 1TB+ models — reduces memory by 2x vs BF16 with minimal accuracy loss
  2. Enable fault tolerance for long training runs — auto-recovery saves days of re-training
  3. Monitor expert load balance during MoE training to catch routing collapse early
  4. Use the built-in evaluation suite to track reward model agreement throughout training
  5. Start with smaller batch sizes and scale up — ensures stability before committing resources
  6. Validate train-inference alignment with the included verification scripts

Common Issues

Expert routing collapse: Increase the load balance loss coefficient. Monitor per-expert utilization and restart with adjusted auxiliary loss if any expert drops below 5% utilization.

FP8 training divergence: Check amax history length — too short causes scale oscillation. Switch to e4m3 format if e5m2 shows instability. Fall back to BF16 for the first 1000 steps before switching to FP8.

Out of memory on large MoE models: Use expert parallelism across GPUs. Reduce max_seq_len or enable gradient checkpointing. Consider pipeline parallelism for models exceeding single-node memory.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates