OpenRLHF Post-Training Toolkit

Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration. Provides efficient implementations of PPO, DPO, GRPO, and REINFORCE algorithms at scale.

When to Use

Choose OpenRLHF when:

Need distributed RLHF/PPO training across multiple GPUs
Want vLLM-accelerated generation during training
Training at scale with Ray orchestration
Need a battle-tested open-source RLHF implementation

Consider alternatives when:

Single-GPU fine-tuning → use TRL directly
Enterprise-scale MoE training → use Miles
PyTorch-native without Ray → use torchforge
Only need DPO/SFT → use TRL (simpler setup)

Quick Start

Installation


# Docker (recommended)
docker run --runtime=nvidia -it --rm --shm-size="10g" \
  --cap-add=SYS_ADMIN \
  -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash

# Install with vLLM support
pip install openrlhf[vllm]

PPO Training


ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"working_dir": "/openrlhf"}' \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 \
  --ref_num_gpus_per_node 1 \
  --reward_num_nodes 1 \
  --reward_num_gpus_per_node 1 \
  --critic_num_nodes 1 \
  --critic_num_gpus_per_node 1 \
  --actor_num_nodes 1 \
  --actor_num_gpus_per_node 1 \
  --vllm_num_engines 1 \
  --vllm_tensor_parallel_size 1 \
  --pretrain meta-llama/Llama-3.1-8B \
  --reward_pretrain meta-llama/Llama-3.1-8B-RM \
  --save_path ./checkpoint \
  --micro_train_batch_size 8 \
  --train_batch_size 128 \
  --micro_rollout_batch_size 32 \
  --rollout_batch_size 1024 \
  --max_epochs 1 \
  --prompt_max_len 1024 \
  --generate_max_len 1024 \
  --learning_rate 5e-7 \
  --init_kl_coef 0.01

DPO Training


deepspeed --module openrlhf.cli.train_dpo \
  --save_path ./dpo-checkpoint \
  --save_steps -1 \
  --logging_steps 1 \
  --micro_train_batch_size 2 \
  --train_batch_size 128 \
  --pretrain meta-llama/Llama-3.1-8B \
  --bf16 \
  --max_epochs 1 \
  --max_len 2048 \
  --dataset your_preference_dataset \
  --learning_rate 5e-7 \
  --beta 0.1

Core Concepts

Architecture

OpenRLHF uses Ray to orchestrate four model roles across GPUs:

Actor — the policy model being trained
Critic — value estimation for advantage computation
Reward Model — scores generated responses
Reference Model — KL divergence anchor (frozen)

vLLM handles generation, providing 2-4x speedup over naive autoregressive decoding.

Supported Algorithms

Algorithm	Use Case	Memory	Complexity
PPO	Full RLHF with learned rewards	High (4 models)	Complex
GRPO	Group-relative optimization	Medium (2 models)	Moderate
DPO	Direct preference optimization	Low (1 model)	Simple
REINFORCE	Basic policy gradient	Medium (2 models)	Moderate
SFT	Supervised fine-tuning	Low (1 model)	Simple

Ray Resource Allocation


# Multi-node PPO configuration
actor:
  num_nodes: 2
  gpus_per_node: 4
critic:
  num_nodes: 1
  gpus_per_node: 4
reward:
  num_nodes: 1
  gpus_per_node: 2
reference:
  num_nodes: 1
  gpus_per_node: 2
vllm:
  num_engines: 4
  tensor_parallel_size: 2

Configuration

Parameter	Default	Description
`pretrain`	—	Base model path
`reward_pretrain`	—	Reward model path
`train_batch_size`	128	Global training batch size
`micro_train_batch_size`	8	Per-GPU batch size
`rollout_batch_size`	1024	Prompts per rollout
`max_epochs`	1	Training epochs
`learning_rate`	5e-7	Actor learning rate
`init_kl_coef`	0.01	Initial KL penalty
`vllm_num_engines`	1	vLLM generation workers
`vllm_tensor_parallel_size`	1	GPUs per vLLM engine

Best Practices

Use Docker for reproducible environments — GPU driver and library versions matter
Start with DPO if you have preference data — simpler than PPO, often comparable results
Scale vLLM engines to match generation throughput with training throughput
Monitor KL divergence — if it exceeds 15, reduce learning rate or increase init_kl_coef
Use gradient checkpointing for models above 13B to fit in GPU memory
Save checkpoints frequently — distributed training failures lose all unsaved progress

Common Issues

Ray cluster not connecting: Ensure Ray head node is running (ray start --head --port=6379). Check firewall rules between nodes. Verify RAY_ADDRESS environment variable.

vLLM OOM during generation: Reduce rollout_batch_size or increase vllm_tensor_parallel_size. vLLM needs dedicated GPU memory separate from training.

Training instability: Lower learning rate to 1e-7. Increase init_kl_coef to 0.05. Ensure reward model produces well-calibrated scores (not all same value).

⚠️ Loading Issue

Post Training Openrlhf Toolkit

OpenRLHF Post-Training Toolkit

When to Use

Quick Start

Installation

PPO Training

DPO Training

Core Concepts

Architecture

Supported Algorithms

Ray Resource Allocation

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace