Comprehensive Pufferlib Module
Comprehensive skill designed for skill, should, used, working. Includes structured workflows, validation checks, and reusable patterns for scientific.
Comprehensive PufferLib Module
Build and train reinforcement learning agents using PufferLib, a library that simplifies RL across diverse environments with unified API wrappers. This skill covers environment wrapping, agent training, policy architecture, performance optimization, and distributed RL workflows.
When to Use This Skill
Choose Comprehensive PufferLib Module when you need to:
- Train RL agents across diverse environments (Atari, MuJoCo, custom) with a consistent API
- Simplify environment vectorization and observation/action space handling
- Optimize RL training performance with efficient batching and GPU utilization
- Compare RL algorithms across different environments with minimal code changes
Consider alternatives when:
- You need production RL deployment without training (use ONNX Runtime or TorchScript)
- You need model-based RL or world models (use DreamerV3 or MBPO implementations)
- You need multi-agent RL specifically (use PettingZoo with RLlib)
Quick Start
pip install pufferlib torch
import pufferlib import pufferlib.environments import pufferlib.vectorization # Create a vectorized environment env_creator = pufferlib.environments.make_env_creator("CartPole-v1") envs = pufferlib.vectorization.Serial( env_creator=env_creator, num_envs=4 ) # Reset and step obs, infos = envs.reset() print(f"Observation shape: {obs.shape}") print(f"Num environments: {envs.num_envs}") # Take random actions import numpy as np actions = np.array([envs.single_action_space.sample() for _ in range(4)]) obs, rewards, dones, truncs, infos = envs.step(actions) print(f"Rewards: {rewards}")
Core Concepts
Key Components
| Component | Purpose | Description |
|---|---|---|
env_creator | Environment factory | Wraps gym/gymnasium environments |
Serial | Sequential vectorization | Runs N envs sequentially |
Multiprocessing | Parallel vectorization | Runs N envs in parallel processes |
Policy | Neural network policy | Customizable actor-critic architecture |
CleanPuffeRL | Training loop | PPO-based training with logging |
Training an RL Agent
import pufferlib import pufferlib.environments import pufferlib.vectorization import pufferlib.models import pufferlib.cleanrl import torch # Setup env_creator = pufferlib.environments.make_env_creator("CartPole-v1") # Create vectorized environments envs = pufferlib.vectorization.Multiprocessing( env_creator=env_creator, num_envs=8, num_workers=4 ) # Define policy network class CartPolePolicy(pufferlib.models.Default): def __init__(self, env): super().__init__(env) obs_size = env.single_observation_space.shape[0] self.actor = torch.nn.Sequential( torch.nn.Linear(obs_size, 64), torch.nn.ReLU(), torch.nn.Linear(64, 64), torch.nn.ReLU(), torch.nn.Linear(64, env.single_action_space.n) ) self.critic = torch.nn.Sequential( torch.nn.Linear(obs_size, 64), torch.nn.ReLU(), torch.nn.Linear(64, 64), torch.nn.ReLU(), torch.nn.Linear(64, 1) ) # Configure training config = pufferlib.cleanrl.CleanPuffeRL( env_creator=env_creator, policy_cls=CartPolePolicy, vectorization=pufferlib.vectorization.Multiprocessing, num_envs=8, total_timesteps=100_000, learning_rate=3e-4, num_steps=128, num_minibatches=4, update_epochs=4 ) # Train config.train()
Custom Environment Wrapping
import pufferlib import gymnasium as gym import numpy as np class CustomEnvWrapper(pufferlib.PufferEnv): """Wrap a custom environment for PufferLib compatibility.""" def __init__(self, env_config=None): self.env = gym.make("LunarLander-v3") super().__init__(self.env) def reset(self, seed=None): obs, info = self.env.reset(seed=seed) return self._process_obs(obs), info def step(self, action): obs, reward, done, trunc, info = self.env.step(action) # Custom reward shaping shaped_reward = reward + 0.1 * (1.0 - abs(obs[0])) return self._process_obs(obs), shaped_reward, done, trunc, info def _process_obs(self, obs): return np.float32(obs) # Use custom wrapper env_creator = lambda: CustomEnvWrapper() envs = pufferlib.vectorization.Serial( env_creator=env_creator, num_envs=4 )
Configuration
| Parameter | Description | Default |
|---|---|---|
num_envs | Number of parallel environments | 8 |
num_workers | Number of worker processes | 4 |
total_timesteps | Total training frames | 1_000_000 |
learning_rate | Optimizer learning rate | 3e-4 |
num_steps | Steps per rollout per env | 128 |
gamma | Discount factor | 0.99 |
gae_lambda | GAE lambda parameter | 0.95 |
clip_coef | PPO clip coefficient | 0.2 |
Best Practices
-
Start with Serial vectorization for debugging — Use
Serialmode first to verify your environment wrapper works correctly. Switch toMultiprocessingonly after confirming correctness, as parallel bugs are harder to diagnose. -
Scale environments before GPU — Increasing
num_envsfrom 4 to 32 often improves sample efficiency more than upgrading GPU. Fill the GPU batch with enough environment data before investing in larger models. -
Normalize observations and rewards — Raw observations and rewards with large ranges cause training instability. Apply observation normalization (running mean/std) and reward scaling to keep values in a reasonable range for the neural network.
-
Use reward shaping carefully — Shaped rewards can accelerate early training but create local optima that prevent finding the true optimal policy. Start with the original reward function and only add shaping if training gets stuck, removing shaping terms gradually.
-
Log training metrics comprehensively — Track episode returns, episode lengths, policy entropy, value loss, and clip fraction. Declining entropy with stable returns indicates convergence. High clip fraction suggests the learning rate is too high.
Common Issues
Training reward stays flat — The policy isn't learning. Check that observations are correctly normalized, the learning rate isn't too small, and the environment isn't returning constant rewards. Verify the action space mapping is correct — swapped actions prevent any meaningful learning.
Multiprocessing environments crash silently — Worker process errors may not propagate to the main process. Run with Serial mode first to catch exceptions, then switch to Multiprocessing. Check that your environment and its dependencies are picklable (required for multiprocessing).
GPU memory errors during training — Reduce num_steps or num_minibatches to decrease the batch size held in GPU memory. Alternatively, reduce the policy network size or switch to torch.float16 for forward passes. Monitor GPU memory with torch.cuda.memory_summary().
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.