Advanced Post-Training with torchforge

PyTorch-native agentic reinforcement learning library from Meta that separates infrastructure concerns from algorithm design, enabling rapid RL research without Ray dependencies.

When to Use

Choose torchforge when:

Need clean separation between RL algorithms and infrastructure
Want PyTorch-native abstractions without Ray overhead
Rapid algorithm experimentation and research
Building custom RL training pipelines
Need fine-grained control over training loops

Consider alternatives when:

Production-scale distributed training → use OpenRLHF (Ray-based)
Enterprise MoE training → use Miles
Simple DPO/SFT → use TRL (lighter weight)
Need vLLM-accelerated generation → use OpenRLHF

Quick Start

Installation


pip install torchforge

# With all dependencies
pip install torchforge[all]

Basic RL Training


from torchforge import RLTrainer, RLConfig, RewardFunction
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

class FormatReward(RewardFunction):
    def compute(self, completions, prompts):
        rewards = []
        for c in completions:
            if "<answer>" in c and "</answer>" in c:
                rewards.append(1.0)
            else:
                rewards.append(-0.5)
        return rewards

config = RLConfig(
    algorithm="grpo",
    learning_rate=1e-6,
    num_generations=4,
    max_completion_length=512,
    output_dir="./torchforge-output",
)

trainer = RLTrainer(
    model=model,
    tokenizer=tokenizer,
    config=config,
    reward_functions=[FormatReward()],
    train_dataset=dataset,
)

trainer.train()

Core Concepts

Algorithm-Infrastructure Separation

torchforge's key design principle is decoupling:

Algorithm layer — defines reward computation, advantage estimation, loss functions
Infrastructure layer — handles distributed training, weight synchronization, checkpointing


# Algorithm concern: Define your RL logic
class CustomAlgorithm(Algorithm):
    def compute_loss(self, batch):
        advantages = self.compute_advantages(batch.rewards)
        return self.policy_gradient_loss(batch.log_probs, advantages)

# Infrastructure concern: Handled automatically
trainer = RLTrainer(
    algorithm=CustomAlgorithm(),
    distributed_config=DistributedConfig(
        strategy="fsdp",
        num_gpus=4,
    )
)

Supported Algorithms

Algorithm	Description	Infrastructure
GRPO	Group relative policy optimization	FSDP/DDP
PPO	Proximal policy optimization	FSDP/DDP
REINFORCE	Basic policy gradient	DDP
Custom	User-defined algorithms	Configurable

Distributed Strategies


from torchforge import DistributedConfig

# FSDP (recommended for large models)
dist_config = DistributedConfig(
    strategy="fsdp",
    num_gpus=8,
    sharding_strategy="full",
    mixed_precision="bf16",
)

# DDP (simpler, fits-in-memory models)
dist_config = DistributedConfig(
    strategy="ddp",
    num_gpus=4,
)

Configuration

Parameter	Default	Description
`algorithm`	"grpo"	RL algorithm to use
`learning_rate`	1e-6	Policy learning rate
`num_generations`	4	Completions per prompt
`max_completion_length`	512	Max generation length
`kl_coef`	0.1	KL divergence penalty
`distributed_strategy`	"fsdp"	Distribution strategy
`mixed_precision`	"bf16"	Training precision
`gradient_checkpointing`	True	Memory optimization

Best Practices

Start with GRPO — simpler than PPO, no critic model needed
Use FSDP for models above 7B — full sharding across GPUs
Write clean reward functions — the algorithm-infrastructure separation lets you focus here
Enable gradient checkpointing for models above 13B
Prototype on single GPU then scale — torchforge makes the transition seamless
Use the built-in logging to track reward curves, KL divergence, and gradient norms

Common Issues

FSDP memory fragmentation: Set sharding_strategy="full" and enable activation checkpointing. If still OOM, reduce batch size or sequence length.

Custom algorithm not converging: Check advantage normalization — raw rewards often need standardization. Verify gradient flow through your custom loss function with torch.autograd.detect_anomaly().

Slow single-GPU training: torchforge's infrastructure layer has overhead for single-GPU. For quick experiments, use TRL directly and migrate to torchforge when you need distributed training.

⚠️ Loading Issue

Advanced Post Training Torchforge

Advanced Post-Training with torchforge

When to Use

Quick Start

Installation

Basic RL Training

Core Concepts

Algorithm-Infrastructure Separation

Supported Algorithms

Distributed Strategies

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace