A

Advanced Post Training Torchforge

Powerful skill for provides, guidance, pytorch, native. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Advanced Post-Training with torchforge

PyTorch-native agentic reinforcement learning library from Meta that separates infrastructure concerns from algorithm design, enabling rapid RL research without Ray dependencies.

When to Use

Choose torchforge when:

  • Need clean separation between RL algorithms and infrastructure
  • Want PyTorch-native abstractions without Ray overhead
  • Rapid algorithm experimentation and research
  • Building custom RL training pipelines
  • Need fine-grained control over training loops

Consider alternatives when:

  • Production-scale distributed training → use OpenRLHF (Ray-based)
  • Enterprise MoE training → use Miles
  • Simple DPO/SFT → use TRL (lighter weight)
  • Need vLLM-accelerated generation → use OpenRLHF

Quick Start

Installation

pip install torchforge # With all dependencies pip install torchforge[all]

Basic RL Training

from torchforge import RLTrainer, RLConfig, RewardFunction from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") class FormatReward(RewardFunction): def compute(self, completions, prompts): rewards = [] for c in completions: if "<answer>" in c and "</answer>" in c: rewards.append(1.0) else: rewards.append(-0.5) return rewards config = RLConfig( algorithm="grpo", learning_rate=1e-6, num_generations=4, max_completion_length=512, output_dir="./torchforge-output", ) trainer = RLTrainer( model=model, tokenizer=tokenizer, config=config, reward_functions=[FormatReward()], train_dataset=dataset, ) trainer.train()

Core Concepts

Algorithm-Infrastructure Separation

torchforge's key design principle is decoupling:

  • Algorithm layer — defines reward computation, advantage estimation, loss functions
  • Infrastructure layer — handles distributed training, weight synchronization, checkpointing
# Algorithm concern: Define your RL logic class CustomAlgorithm(Algorithm): def compute_loss(self, batch): advantages = self.compute_advantages(batch.rewards) return self.policy_gradient_loss(batch.log_probs, advantages) # Infrastructure concern: Handled automatically trainer = RLTrainer( algorithm=CustomAlgorithm(), distributed_config=DistributedConfig( strategy="fsdp", num_gpus=4, ) )

Supported Algorithms

AlgorithmDescriptionInfrastructure
GRPOGroup relative policy optimizationFSDP/DDP
PPOProximal policy optimizationFSDP/DDP
REINFORCEBasic policy gradientDDP
CustomUser-defined algorithmsConfigurable

Distributed Strategies

from torchforge import DistributedConfig # FSDP (recommended for large models) dist_config = DistributedConfig( strategy="fsdp", num_gpus=8, sharding_strategy="full", mixed_precision="bf16", ) # DDP (simpler, fits-in-memory models) dist_config = DistributedConfig( strategy="ddp", num_gpus=4, )

Configuration

ParameterDefaultDescription
algorithm"grpo"RL algorithm to use
learning_rate1e-6Policy learning rate
num_generations4Completions per prompt
max_completion_length512Max generation length
kl_coef0.1KL divergence penalty
distributed_strategy"fsdp"Distribution strategy
mixed_precision"bf16"Training precision
gradient_checkpointingTrueMemory optimization

Best Practices

  1. Start with GRPO — simpler than PPO, no critic model needed
  2. Use FSDP for models above 7B — full sharding across GPUs
  3. Write clean reward functions — the algorithm-infrastructure separation lets you focus here
  4. Enable gradient checkpointing for models above 13B
  5. Prototype on single GPU then scale — torchforge makes the transition seamless
  6. Use the built-in logging to track reward curves, KL divergence, and gradient norms

Common Issues

FSDP memory fragmentation: Set sharding_strategy="full" and enable activation checkpointing. If still OOM, reduce batch size or sequence length.

Custom algorithm not converging: Check advantage normalization — raw rewards often need standardization. Verify gradient flow through your custom loss function with torch.autograd.detect_anomaly().

Slow single-GPU training: torchforge's infrastructure layer has overhead for single-GPU. For quick experiments, use TRL directly and migrate to torchforge when you need distributed training.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates