Comprehensive PufferLib Module

Build and train reinforcement learning agents using PufferLib, a library that simplifies RL across diverse environments with unified API wrappers. This skill covers environment wrapping, agent training, policy architecture, performance optimization, and distributed RL workflows.

When to Use This Skill

Choose Comprehensive PufferLib Module when you need to:

Train RL agents across diverse environments (Atari, MuJoCo, custom) with a consistent API
Simplify environment vectorization and observation/action space handling
Optimize RL training performance with efficient batching and GPU utilization
Compare RL algorithms across different environments with minimal code changes

Consider alternatives when:

You need production RL deployment without training (use ONNX Runtime or TorchScript)
You need model-based RL or world models (use DreamerV3 or MBPO implementations)
You need multi-agent RL specifically (use PettingZoo with RLlib)

Quick Start


pip install pufferlib torch


import pufferlib
import pufferlib.environments
import pufferlib.vectorization

# Create a vectorized environment
env_creator = pufferlib.environments.make_env_creator("CartPole-v1")
envs = pufferlib.vectorization.Serial(
    env_creator=env_creator,
    num_envs=4
)

# Reset and step
obs, infos = envs.reset()
print(f"Observation shape: {obs.shape}")
print(f"Num environments: {envs.num_envs}")

# Take random actions
import numpy as np
actions = np.array([envs.single_action_space.sample() for _ in range(4)])
obs, rewards, dones, truncs, infos = envs.step(actions)
print(f"Rewards: {rewards}")

Core Concepts

Key Components

Component	Purpose	Description
`env_creator`	Environment factory	Wraps gym/gymnasium environments
`Serial`	Sequential vectorization	Runs N envs sequentially
`Multiprocessing`	Parallel vectorization	Runs N envs in parallel processes
`Policy`	Neural network policy	Customizable actor-critic architecture
`CleanPuffeRL`	Training loop	PPO-based training with logging

Training an RL Agent


import pufferlib
import pufferlib.environments
import pufferlib.vectorization
import pufferlib.models
import pufferlib.cleanrl
import torch

# Setup
env_creator = pufferlib.environments.make_env_creator("CartPole-v1")

# Create vectorized environments
envs = pufferlib.vectorization.Multiprocessing(
    env_creator=env_creator,
    num_envs=8,
    num_workers=4
)

# Define policy network
class CartPolePolicy(pufferlib.models.Default):
    def __init__(self, env):
        super().__init__(env)
        obs_size = env.single_observation_space.shape[0]
        self.actor = torch.nn.Sequential(
            torch.nn.Linear(obs_size, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, env.single_action_space.n)
        )
        self.critic = torch.nn.Sequential(
            torch.nn.Linear(obs_size, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 1)
        )

# Configure training
config = pufferlib.cleanrl.CleanPuffeRL(
    env_creator=env_creator,
    policy_cls=CartPolePolicy,
    vectorization=pufferlib.vectorization.Multiprocessing,
    num_envs=8,
    total_timesteps=100_000,
    learning_rate=3e-4,
    num_steps=128,
    num_minibatches=4,
    update_epochs=4
)

# Train
config.train()

Custom Environment Wrapping


import pufferlib
import gymnasium as gym
import numpy as np

class CustomEnvWrapper(pufferlib.PufferEnv):
    """Wrap a custom environment for PufferLib compatibility."""

    def __init__(self, env_config=None):
        self.env = gym.make("LunarLander-v3")
        super().__init__(self.env)

    def reset(self, seed=None):
        obs, info = self.env.reset(seed=seed)
        return self._process_obs(obs), info

    def step(self, action):
        obs, reward, done, trunc, info = self.env.step(action)

        # Custom reward shaping
        shaped_reward = reward + 0.1 * (1.0 - abs(obs[0]))

        return self._process_obs(obs), shaped_reward, done, trunc, info

    def _process_obs(self, obs):
        return np.float32(obs)

# Use custom wrapper
env_creator = lambda: CustomEnvWrapper()
envs = pufferlib.vectorization.Serial(
    env_creator=env_creator, num_envs=4
)

Configuration

Parameter	Description	Default
`num_envs`	Number of parallel environments	`8`
`num_workers`	Number of worker processes	`4`
`total_timesteps`	Total training frames	`1_000_000`
`learning_rate`	Optimizer learning rate	`3e-4`
`num_steps`	Steps per rollout per env	`128`
`gamma`	Discount factor	`0.99`
`gae_lambda`	GAE lambda parameter	`0.95`
`clip_coef`	PPO clip coefficient	`0.2`

Best Practices

Start with Serial vectorization for debugging — Use Serial mode first to verify your environment wrapper works correctly. Switch to Multiprocessing only after confirming correctness, as parallel bugs are harder to diagnose.
Scale environments before GPU — Increasing num_envs from 4 to 32 often improves sample efficiency more than upgrading GPU. Fill the GPU batch with enough environment data before investing in larger models.
Normalize observations and rewards — Raw observations and rewards with large ranges cause training instability. Apply observation normalization (running mean/std) and reward scaling to keep values in a reasonable range for the neural network.
Use reward shaping carefully — Shaped rewards can accelerate early training but create local optima that prevent finding the true optimal policy. Start with the original reward function and only add shaping if training gets stuck, removing shaping terms gradually.
Log training metrics comprehensively — Track episode returns, episode lengths, policy entropy, value loss, and clip fraction. Declining entropy with stable returns indicates convergence. High clip fraction suggests the learning rate is too high.

Common Issues

Training reward stays flat — The policy isn't learning. Check that observations are correctly normalized, the learning rate isn't too small, and the environment isn't returning constant rewards. Verify the action space mapping is correct — swapped actions prevent any meaningful learning.

Multiprocessing environments crash silently — Worker process errors may not propagate to the main process. Run with Serial mode first to catch exceptions, then switch to Multiprocessing. Check that your environment and its dependencies are picklable (required for multiprocessing).

GPU memory errors during training — Reduce num_steps or num_minibatches to decrease the batch size held in GPU memory. Alternatively, reduce the policy network size or switch to torch.float16 for forward passes. Monitor GPU memory with torch.cuda.memory_summary().

⚠️ Loading Issue

Comprehensive Pufferlib Module

Comprehensive PufferLib Module

When to Use This Skill

Quick Start

Core Concepts

Key Components

Training an RL Agent

Custom Environment Wrapping

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace