P

Post Training Openrlhf Toolkit

Streamline your workflow with this high, performance, rlhf, framework. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

OpenRLHF Post-Training Toolkit

Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration. Provides efficient implementations of PPO, DPO, GRPO, and REINFORCE algorithms at scale.

When to Use

Choose OpenRLHF when:

  • Need distributed RLHF/PPO training across multiple GPUs
  • Want vLLM-accelerated generation during training
  • Training at scale with Ray orchestration
  • Need a battle-tested open-source RLHF implementation

Consider alternatives when:

  • Single-GPU fine-tuning → use TRL directly
  • Enterprise-scale MoE training → use Miles
  • PyTorch-native without Ray → use torchforge
  • Only need DPO/SFT → use TRL (simpler setup)

Quick Start

Installation

# Docker (recommended) docker run --runtime=nvidia -it --rm --shm-size="10g" \ --cap-add=SYS_ADMIN \ -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash # Install with vLLM support pip install openrlhf[vllm]

PPO Training

ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "/openrlhf"}' \ -- python3 -m openrlhf.cli.train_ppo_ray \ --ref_num_nodes 1 \ --ref_num_gpus_per_node 1 \ --reward_num_nodes 1 \ --reward_num_gpus_per_node 1 \ --critic_num_nodes 1 \ --critic_num_gpus_per_node 1 \ --actor_num_nodes 1 \ --actor_num_gpus_per_node 1 \ --vllm_num_engines 1 \ --vllm_tensor_parallel_size 1 \ --pretrain meta-llama/Llama-3.1-8B \ --reward_pretrain meta-llama/Llama-3.1-8B-RM \ --save_path ./checkpoint \ --micro_train_batch_size 8 \ --train_batch_size 128 \ --micro_rollout_batch_size 32 \ --rollout_batch_size 1024 \ --max_epochs 1 \ --prompt_max_len 1024 \ --generate_max_len 1024 \ --learning_rate 5e-7 \ --init_kl_coef 0.01

DPO Training

deepspeed --module openrlhf.cli.train_dpo \ --save_path ./dpo-checkpoint \ --save_steps -1 \ --logging_steps 1 \ --micro_train_batch_size 2 \ --train_batch_size 128 \ --pretrain meta-llama/Llama-3.1-8B \ --bf16 \ --max_epochs 1 \ --max_len 2048 \ --dataset your_preference_dataset \ --learning_rate 5e-7 \ --beta 0.1

Core Concepts

Architecture

OpenRLHF uses Ray to orchestrate four model roles across GPUs:

  • Actor — the policy model being trained
  • Critic — value estimation for advantage computation
  • Reward Model — scores generated responses
  • Reference Model — KL divergence anchor (frozen)

vLLM handles generation, providing 2-4x speedup over naive autoregressive decoding.

Supported Algorithms

AlgorithmUse CaseMemoryComplexity
PPOFull RLHF with learned rewardsHigh (4 models)Complex
GRPOGroup-relative optimizationMedium (2 models)Moderate
DPODirect preference optimizationLow (1 model)Simple
REINFORCEBasic policy gradientMedium (2 models)Moderate
SFTSupervised fine-tuningLow (1 model)Simple

Ray Resource Allocation

# Multi-node PPO configuration actor: num_nodes: 2 gpus_per_node: 4 critic: num_nodes: 1 gpus_per_node: 4 reward: num_nodes: 1 gpus_per_node: 2 reference: num_nodes: 1 gpus_per_node: 2 vllm: num_engines: 4 tensor_parallel_size: 2

Configuration

ParameterDefaultDescription
pretrainBase model path
reward_pretrainReward model path
train_batch_size128Global training batch size
micro_train_batch_size8Per-GPU batch size
rollout_batch_size1024Prompts per rollout
max_epochs1Training epochs
learning_rate5e-7Actor learning rate
init_kl_coef0.01Initial KL penalty
vllm_num_engines1vLLM generation workers
vllm_tensor_parallel_size1GPUs per vLLM engine

Best Practices

  1. Use Docker for reproducible environments — GPU driver and library versions matter
  2. Start with DPO if you have preference data — simpler than PPO, often comparable results
  3. Scale vLLM engines to match generation throughput with training throughput
  4. Monitor KL divergence — if it exceeds 15, reduce learning rate or increase init_kl_coef
  5. Use gradient checkpointing for models above 13B to fit in GPU memory
  6. Save checkpoints frequently — distributed training failures lose all unsaved progress

Common Issues

Ray cluster not connecting: Ensure Ray head node is running (ray start --head --port=6379). Check firewall rules between nodes. Verify RAY_ADDRESS environment variable.

vLLM OOM during generation: Reduce rollout_batch_size or increase vllm_tensor_parallel_size. vLLM needs dedicated GPU memory separate from training.

Training instability: Lower learning rate to 1e-7. Increase init_kl_coef to 0.05. Ensure reward model produces well-calibrated scores (not all same value).

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates