Advanced Post Training Torchforge
Powerful skill for provides, guidance, pytorch, native. Includes structured workflows, validation checks, and reusable patterns for ai research.
Advanced Post-Training with torchforge
PyTorch-native agentic reinforcement learning library from Meta that separates infrastructure concerns from algorithm design, enabling rapid RL research without Ray dependencies.
When to Use
Choose torchforge when:
- Need clean separation between RL algorithms and infrastructure
- Want PyTorch-native abstractions without Ray overhead
- Rapid algorithm experimentation and research
- Building custom RL training pipelines
- Need fine-grained control over training loops
Consider alternatives when:
- Production-scale distributed training → use OpenRLHF (Ray-based)
- Enterprise MoE training → use Miles
- Simple DPO/SFT → use TRL (lighter weight)
- Need vLLM-accelerated generation → use OpenRLHF
Quick Start
Installation
pip install torchforge # With all dependencies pip install torchforge[all]
Basic RL Training
from torchforge import RLTrainer, RLConfig, RewardFunction from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") class FormatReward(RewardFunction): def compute(self, completions, prompts): rewards = [] for c in completions: if "<answer>" in c and "</answer>" in c: rewards.append(1.0) else: rewards.append(-0.5) return rewards config = RLConfig( algorithm="grpo", learning_rate=1e-6, num_generations=4, max_completion_length=512, output_dir="./torchforge-output", ) trainer = RLTrainer( model=model, tokenizer=tokenizer, config=config, reward_functions=[FormatReward()], train_dataset=dataset, ) trainer.train()
Core Concepts
Algorithm-Infrastructure Separation
torchforge's key design principle is decoupling:
- Algorithm layer — defines reward computation, advantage estimation, loss functions
- Infrastructure layer — handles distributed training, weight synchronization, checkpointing
# Algorithm concern: Define your RL logic class CustomAlgorithm(Algorithm): def compute_loss(self, batch): advantages = self.compute_advantages(batch.rewards) return self.policy_gradient_loss(batch.log_probs, advantages) # Infrastructure concern: Handled automatically trainer = RLTrainer( algorithm=CustomAlgorithm(), distributed_config=DistributedConfig( strategy="fsdp", num_gpus=4, ) )
Supported Algorithms
| Algorithm | Description | Infrastructure |
|---|---|---|
| GRPO | Group relative policy optimization | FSDP/DDP |
| PPO | Proximal policy optimization | FSDP/DDP |
| REINFORCE | Basic policy gradient | DDP |
| Custom | User-defined algorithms | Configurable |
Distributed Strategies
from torchforge import DistributedConfig # FSDP (recommended for large models) dist_config = DistributedConfig( strategy="fsdp", num_gpus=8, sharding_strategy="full", mixed_precision="bf16", ) # DDP (simpler, fits-in-memory models) dist_config = DistributedConfig( strategy="ddp", num_gpus=4, )
Configuration
| Parameter | Default | Description |
|---|---|---|
algorithm | "grpo" | RL algorithm to use |
learning_rate | 1e-6 | Policy learning rate |
num_generations | 4 | Completions per prompt |
max_completion_length | 512 | Max generation length |
kl_coef | 0.1 | KL divergence penalty |
distributed_strategy | "fsdp" | Distribution strategy |
mixed_precision | "bf16" | Training precision |
gradient_checkpointing | True | Memory optimization |
Best Practices
- Start with GRPO — simpler than PPO, no critic model needed
- Use FSDP for models above 7B — full sharding across GPUs
- Write clean reward functions — the algorithm-infrastructure separation lets you focus here
- Enable gradient checkpointing for models above 13B
- Prototype on single GPU then scale — torchforge makes the transition seamless
- Use the built-in logging to track reward curves, KL divergence, and gradient norms
Common Issues
FSDP memory fragmentation:
Set sharding_strategy="full" and enable activation checkpointing. If still OOM, reduce batch size or sequence length.
Custom algorithm not converging:
Check advantage normalization — raw rewards often need standardization. Verify gradient flow through your custom loss function with torch.autograd.detect_anomaly().
Slow single-GPU training: torchforge's infrastructure layer has overhead for single-GPU. For quick experiments, use TRL directly and migrate to torchforge when you need distributed training.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.