C

Comprehensive Pufferlib Module

Comprehensive skill designed for skill, should, used, working. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Comprehensive PufferLib Module

Build and train reinforcement learning agents using PufferLib, a library that simplifies RL across diverse environments with unified API wrappers. This skill covers environment wrapping, agent training, policy architecture, performance optimization, and distributed RL workflows.

When to Use This Skill

Choose Comprehensive PufferLib Module when you need to:

  • Train RL agents across diverse environments (Atari, MuJoCo, custom) with a consistent API
  • Simplify environment vectorization and observation/action space handling
  • Optimize RL training performance with efficient batching and GPU utilization
  • Compare RL algorithms across different environments with minimal code changes

Consider alternatives when:

  • You need production RL deployment without training (use ONNX Runtime or TorchScript)
  • You need model-based RL or world models (use DreamerV3 or MBPO implementations)
  • You need multi-agent RL specifically (use PettingZoo with RLlib)

Quick Start

pip install pufferlib torch
import pufferlib import pufferlib.environments import pufferlib.vectorization # Create a vectorized environment env_creator = pufferlib.environments.make_env_creator("CartPole-v1") envs = pufferlib.vectorization.Serial( env_creator=env_creator, num_envs=4 ) # Reset and step obs, infos = envs.reset() print(f"Observation shape: {obs.shape}") print(f"Num environments: {envs.num_envs}") # Take random actions import numpy as np actions = np.array([envs.single_action_space.sample() for _ in range(4)]) obs, rewards, dones, truncs, infos = envs.step(actions) print(f"Rewards: {rewards}")

Core Concepts

Key Components

ComponentPurposeDescription
env_creatorEnvironment factoryWraps gym/gymnasium environments
SerialSequential vectorizationRuns N envs sequentially
MultiprocessingParallel vectorizationRuns N envs in parallel processes
PolicyNeural network policyCustomizable actor-critic architecture
CleanPuffeRLTraining loopPPO-based training with logging

Training an RL Agent

import pufferlib import pufferlib.environments import pufferlib.vectorization import pufferlib.models import pufferlib.cleanrl import torch # Setup env_creator = pufferlib.environments.make_env_creator("CartPole-v1") # Create vectorized environments envs = pufferlib.vectorization.Multiprocessing( env_creator=env_creator, num_envs=8, num_workers=4 ) # Define policy network class CartPolePolicy(pufferlib.models.Default): def __init__(self, env): super().__init__(env) obs_size = env.single_observation_space.shape[0] self.actor = torch.nn.Sequential( torch.nn.Linear(obs_size, 64), torch.nn.ReLU(), torch.nn.Linear(64, 64), torch.nn.ReLU(), torch.nn.Linear(64, env.single_action_space.n) ) self.critic = torch.nn.Sequential( torch.nn.Linear(obs_size, 64), torch.nn.ReLU(), torch.nn.Linear(64, 64), torch.nn.ReLU(), torch.nn.Linear(64, 1) ) # Configure training config = pufferlib.cleanrl.CleanPuffeRL( env_creator=env_creator, policy_cls=CartPolePolicy, vectorization=pufferlib.vectorization.Multiprocessing, num_envs=8, total_timesteps=100_000, learning_rate=3e-4, num_steps=128, num_minibatches=4, update_epochs=4 ) # Train config.train()

Custom Environment Wrapping

import pufferlib import gymnasium as gym import numpy as np class CustomEnvWrapper(pufferlib.PufferEnv): """Wrap a custom environment for PufferLib compatibility.""" def __init__(self, env_config=None): self.env = gym.make("LunarLander-v3") super().__init__(self.env) def reset(self, seed=None): obs, info = self.env.reset(seed=seed) return self._process_obs(obs), info def step(self, action): obs, reward, done, trunc, info = self.env.step(action) # Custom reward shaping shaped_reward = reward + 0.1 * (1.0 - abs(obs[0])) return self._process_obs(obs), shaped_reward, done, trunc, info def _process_obs(self, obs): return np.float32(obs) # Use custom wrapper env_creator = lambda: CustomEnvWrapper() envs = pufferlib.vectorization.Serial( env_creator=env_creator, num_envs=4 )

Configuration

ParameterDescriptionDefault
num_envsNumber of parallel environments8
num_workersNumber of worker processes4
total_timestepsTotal training frames1_000_000
learning_rateOptimizer learning rate3e-4
num_stepsSteps per rollout per env128
gammaDiscount factor0.99
gae_lambdaGAE lambda parameter0.95
clip_coefPPO clip coefficient0.2

Best Practices

  1. Start with Serial vectorization for debugging — Use Serial mode first to verify your environment wrapper works correctly. Switch to Multiprocessing only after confirming correctness, as parallel bugs are harder to diagnose.

  2. Scale environments before GPU — Increasing num_envs from 4 to 32 often improves sample efficiency more than upgrading GPU. Fill the GPU batch with enough environment data before investing in larger models.

  3. Normalize observations and rewards — Raw observations and rewards with large ranges cause training instability. Apply observation normalization (running mean/std) and reward scaling to keep values in a reasonable range for the neural network.

  4. Use reward shaping carefully — Shaped rewards can accelerate early training but create local optima that prevent finding the true optimal policy. Start with the original reward function and only add shaping if training gets stuck, removing shaping terms gradually.

  5. Log training metrics comprehensively — Track episode returns, episode lengths, policy entropy, value loss, and clip fraction. Declining entropy with stable returns indicates convergence. High clip fraction suggests the learning rate is too high.

Common Issues

Training reward stays flat — The policy isn't learning. Check that observations are correctly normalized, the learning rate isn't too small, and the environment isn't returning constant rewards. Verify the action space mapping is correct — swapped actions prevent any meaningful learning.

Multiprocessing environments crash silently — Worker process errors may not propagate to the main process. Run with Serial mode first to catch exceptions, then switch to Multiprocessing. Check that your environment and its dependencies are picklable (required for multiprocessing).

GPU memory errors during training — Reduce num_steps or num_minibatches to decrease the batch size held in GPU memory. Alternatively, reduce the policy network size or switch to torch.float16 for forward passes. Monitor GPU memory with torch.cuda.memory_summary().

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates