Post Training Openrlhf Toolkit
Streamline your workflow with this high, performance, rlhf, framework. Includes structured workflows, validation checks, and reusable patterns for ai research.
OpenRLHF Post-Training Toolkit
Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration. Provides efficient implementations of PPO, DPO, GRPO, and REINFORCE algorithms at scale.
When to Use
Choose OpenRLHF when:
- Need distributed RLHF/PPO training across multiple GPUs
- Want vLLM-accelerated generation during training
- Training at scale with Ray orchestration
- Need a battle-tested open-source RLHF implementation
Consider alternatives when:
- Single-GPU fine-tuning → use TRL directly
- Enterprise-scale MoE training → use Miles
- PyTorch-native without Ray → use torchforge
- Only need DPO/SFT → use TRL (simpler setup)
Quick Start
Installation
# Docker (recommended) docker run --runtime=nvidia -it --rm --shm-size="10g" \ --cap-add=SYS_ADMIN \ -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash # Install with vLLM support pip install openrlhf[vllm]
PPO Training
ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "/openrlhf"}' \ -- python3 -m openrlhf.cli.train_ppo_ray \ --ref_num_nodes 1 \ --ref_num_gpus_per_node 1 \ --reward_num_nodes 1 \ --reward_num_gpus_per_node 1 \ --critic_num_nodes 1 \ --critic_num_gpus_per_node 1 \ --actor_num_nodes 1 \ --actor_num_gpus_per_node 1 \ --vllm_num_engines 1 \ --vllm_tensor_parallel_size 1 \ --pretrain meta-llama/Llama-3.1-8B \ --reward_pretrain meta-llama/Llama-3.1-8B-RM \ --save_path ./checkpoint \ --micro_train_batch_size 8 \ --train_batch_size 128 \ --micro_rollout_batch_size 32 \ --rollout_batch_size 1024 \ --max_epochs 1 \ --prompt_max_len 1024 \ --generate_max_len 1024 \ --learning_rate 5e-7 \ --init_kl_coef 0.01
DPO Training
deepspeed --module openrlhf.cli.train_dpo \ --save_path ./dpo-checkpoint \ --save_steps -1 \ --logging_steps 1 \ --micro_train_batch_size 2 \ --train_batch_size 128 \ --pretrain meta-llama/Llama-3.1-8B \ --bf16 \ --max_epochs 1 \ --max_len 2048 \ --dataset your_preference_dataset \ --learning_rate 5e-7 \ --beta 0.1
Core Concepts
Architecture
OpenRLHF uses Ray to orchestrate four model roles across GPUs:
- Actor — the policy model being trained
- Critic — value estimation for advantage computation
- Reward Model — scores generated responses
- Reference Model — KL divergence anchor (frozen)
vLLM handles generation, providing 2-4x speedup over naive autoregressive decoding.
Supported Algorithms
| Algorithm | Use Case | Memory | Complexity |
|---|---|---|---|
| PPO | Full RLHF with learned rewards | High (4 models) | Complex |
| GRPO | Group-relative optimization | Medium (2 models) | Moderate |
| DPO | Direct preference optimization | Low (1 model) | Simple |
| REINFORCE | Basic policy gradient | Medium (2 models) | Moderate |
| SFT | Supervised fine-tuning | Low (1 model) | Simple |
Ray Resource Allocation
# Multi-node PPO configuration actor: num_nodes: 2 gpus_per_node: 4 critic: num_nodes: 1 gpus_per_node: 4 reward: num_nodes: 1 gpus_per_node: 2 reference: num_nodes: 1 gpus_per_node: 2 vllm: num_engines: 4 tensor_parallel_size: 2
Configuration
| Parameter | Default | Description |
|---|---|---|
pretrain | — | Base model path |
reward_pretrain | — | Reward model path |
train_batch_size | 128 | Global training batch size |
micro_train_batch_size | 8 | Per-GPU batch size |
rollout_batch_size | 1024 | Prompts per rollout |
max_epochs | 1 | Training epochs |
learning_rate | 5e-7 | Actor learning rate |
init_kl_coef | 0.01 | Initial KL penalty |
vllm_num_engines | 1 | vLLM generation workers |
vllm_tensor_parallel_size | 1 | GPUs per vLLM engine |
Best Practices
- Use Docker for reproducible environments — GPU driver and library versions matter
- Start with DPO if you have preference data — simpler than PPO, often comparable results
- Scale vLLM engines to match generation throughput with training throughput
- Monitor KL divergence — if it exceeds 15, reduce learning rate or increase
init_kl_coef - Use gradient checkpointing for models above 13B to fit in GPU memory
- Save checkpoints frequently — distributed training failures lose all unsaved progress
Common Issues
Ray cluster not connecting:
Ensure Ray head node is running (ray start --head --port=6379). Check firewall rules between nodes. Verify RAY_ADDRESS environment variable.
vLLM OOM during generation:
Reduce rollout_batch_size or increase vllm_tensor_parallel_size. vLLM needs dedicated GPU memory separate from training.
Training instability:
Lower learning rate to 1e-7. Increase init_kl_coef to 0.05. Ensure reward model produces well-calibrated scores (not all same value).
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.