Introduction
In the previous episodes, we explored the theoretical foundations of reinforcement learning through Markov Decision Processes, built custom Gym environments, and compared different algorithmic approaches. Now it’s time to bridge theory and practice using Stable Baselines3 (SB3), the industry-standard library for production-ready RL training.
Stable Baselines3 provides reliable, well-tested implementations of state-of-the-art RL algorithms with a consistent API. This episode covers the architecture of SB3, hyperparameter tuning strategies, training best practices, and debugging techniques that will help you train robust agents efficiently.
Stable Baselines3 Architecture and Supported Algorithms
Core Design Philosophy
SB3 is built on three key principles:
- Consistency: All algorithms share a common interface (
learn(),predict(),save(),load()) - Modularity: Components like policies, buffers, and callbacks are interchangeable
- Reliability: Extensive testing ensures reproducible results
Supported Algorithms
SB3 implements major RL algorithm families:
| Algorithm | Type | Action Space | Best For | Stability |
|---|---|---|---|---|
| PPO | On-policy | Discrete/Continuous | General-purpose, robotics | High |
| SAC | Off-policy | Continuous only | Sample efficiency, fine control | High |
| TD3 | Off-policy | Continuous only | Robotics, low variance | Medium |
| A2C | On-policy | Discrete/Continuous | Fast prototyping | Medium |
| DQN | Off-policy | Discrete only | Atari games, simple tasks | Medium |
Installation and basic usage:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
# Create vectorized environment (4 parallel instances)
env = make_vec_env("CartPole-v1", n_envs=4)
# Initialize PPO agent with default hyperparameters
model = PPO("MlpPolicy", env, verbose=1)
# Train for 100k timesteps
model.learn(total_timesteps=100000)
# Save trained model
model.save("ppo_cartpole")
# Load and evaluate
model = PPO.load("ppo_cartpole")
obs, info = env.reset()
for _ in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, truncated, info = env.step(action)
Hyperparameter Tuning: The Science of RL Training
Hyperparameters dramatically affect training performance. Here’s a systematic approach to tuning the most critical parameters.
Learning Rate Schedules
The learning rate controls gradient descent step size. A decaying schedule often improves stability:
from stable_baselines3.common.callbacks import BaseCallback
import numpy as np
def linear_schedule(initial_value: float):
"""
Linear learning rate schedule.
Args:
initial_value: Starting learning rate
Returns:
Schedule function that takes progress (0 to 1) and returns current LR
"""
def func(progress_remaining: float) -> float:
# progress_remaining goes from 1 to 0
return progress_remaining * initial_value
return func
# Apply linear decay from 3e-4 to 0
model = PPO(
"MlpPolicy",
env,
learning_rate=linear_schedule(3e-4),
verbose=1
)
For exponential decay:
def exponential_schedule(initial_value: float, decay_rate: float = 0.96):
def func(progress_remaining: float) -> float:
# Exponential decay: lr = initial * decay^(1 - progress)
return initial_value * (decay_rate ** (1 - progress_remaining))
return func
model = SAC(
"MlpPolicy",
env,
learning_rate=exponential_schedule(1e-3, decay_rate=0.95)
)
Critical Hyperparameters Explained
1. Batch Size and n_steps
For on-policy algorithms (PPO, A2C):
n_steps: Number of steps to collect before each updatebatch_size: Minibatch size for gradient updates- Total experience per update:
n_envs × n_steps
model = PPO(
"MlpPolicy",
env,
n_steps=2048, # Collect 2048 steps per env before update
batch_size=64, # Process in minibatches of 64
n_epochs=10, # 10 gradient descent passes per update
verbose=1
)
Rule of thumb: Larger n_steps → more stable gradients but slower learning. Set batch_size to divide n_steps × n_envs evenly.
2. Discount Factor (Gamma)
The discount factor determines how much future rewards matter:
Where:
– : Total discounted return from timestep
– : Reward at timestep
– : Discount factor
# Short-term tasks (quick rewards needed)
model = PPO("MlpPolicy", env, gamma=0.95)
# Long-term planning (robotics, strategy games)
model = PPO("MlpPolicy", env, gamma=0.99)
3. Entropy Coefficient
Entropy bonus encourages exploration by rewarding policy randomness:
Where:
– : Policy gradient loss
– : Value function loss
– : Policy entropy (higher = more exploration)
– : Entropy coefficient (typically 0.01)
# High exploration (complex environments)
model = PPO("MlpPolicy", env, ent_coef=0.02)
# Low exploration (near-optimal policy known)
model = PPO("MlpPolicy", env, ent_coef=0.0)
Hyperparameter Configuration Example
Here’s a production-ready PPO configuration for a continuous control task:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
# Vectorize environment for parallel sampling
env = make_vec_env("Pendulum-v1", n_envs=8)
model = PPO(
policy="MlpPolicy",
env=env,
learning_rate=linear_schedule(3e-4), # Decay from 3e-4 to 0
n_steps=2048, # 2048 steps per env
batch_size=64, # Minibatch size
n_epochs=10, # 10 optimization epochs
gamma=0.99, # Strong future discounting
gae_lambda=0.95, # GAE smoothing parameter
clip_range=0.2, # PPO clipping range
ent_coef=0.01, # Moderate exploration
vf_coef=0.5, # Value function loss weight
max_grad_norm=0.5, # Gradient clipping
verbose=1,
tensorboard_log="./ppo_pendulum_tensorboard/"
)
Training Best Practices
Observation Normalization with VecNormalize
Neural networks train best when inputs are normalized. VecNormalize automatically standardizes observations:
from stable_baselines3.common.vec_env import VecNormalize
from stable_baselines3.common.env_util import make_vec_env
# Create and wrap environment
env = make_vec_env("Pendulum-v1", n_envs=4)
env = VecNormalize(
env,
norm_obs=True, # Normalize observations
norm_reward=True, # Normalize rewards
clip_obs=10.0, # Clip normalized obs to [-10, 10]
clip_reward=10.0, # Clip normalized rewards
gamma=0.99 # Discount for reward normalization
)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)
# IMPORTANT: Save normalization statistics with model
model.save("ppo_normalized")
env.save("vec_normalize_stats.pkl")
# Load for inference
model = PPO.load("ppo_normalized")
env = make_vec_env("Pendulum-v1", n_envs=1)
env = VecNormalize.load("vec_normalize_stats.pkl", env)
env.training = False # Disable updates during evaluation
env.norm_reward = False # Don't normalize rewards during eval
Reward Scaling and Shaping
If your environment returns very large or small rewards, scale them:
import gymnasium as gym
import numpy as np
class RewardScalingWrapper(gym.RewardWrapper):
"""Scale rewards by constant factor."""
def __init__(self, env, scale=0.01):
super().__init__(env)
self.scale = scale
def reward(self, reward):
return reward * self.scale
# Apply to environment
env = gym.make("YourEnv-v0")
env = RewardScalingWrapper(env, scale=0.01)
Action Clipping
Prevent extreme actions that might destabilize training:
class ActionClippingWrapper(gym.ActionWrapper):
"""Clip actions to safe range."""
def __init__(self, env, min_action=-1.0, max_action=1.0):
super().__init__(env)
self.min_action = min_action
self.max_action = max_action
def action(self, action):
return np.clip(action, self.min_action, self.max_action)
env = ActionClippingWrapper(env, min_action=-0.5, max_action=0.5)
Callbacks: Monitoring and Control
Callbacks hook into the training loop for logging, checkpointing, and early stopping.
EvalCallback: Periodic Evaluation
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env
# Training environment
train_env = make_vec_env("CartPole-v1", n_envs=4)
# Separate evaluation environment (no training noise)
eval_env = make_vec_env("CartPole-v1", n_envs=1)
eval_callback = EvalCallback(
eval_env,
best_model_save_path="./logs/best_model/",
log_path="./logs/results/",
eval_freq=10000, # Evaluate every 10k steps
n_eval_episodes=10, # Average over 10 episodes
deterministic=True, # Use deterministic policy
render=False
)
model = PPO("MlpPolicy", train_env, verbose=1)
model.learn(total_timesteps=200000, callback=eval_callback)
CheckpointCallback: Regular Model Saving
from stable_baselines3.common.callbacks import CheckpointCallback
checkpoint_callback = CheckpointCallback(
save_freq=50000, # Save every 50k steps
save_path="./checkpoints/",
name_prefix="ppo_model"
)
model.learn(total_timesteps=500000, callback=checkpoint_callback)
# Creates: ppo_model_50000_steps.zip, ppo_model_100000_steps.zip, ...
Custom Callback: Advanced Logging
from stable_baselines3.common.callbacks import BaseCallback
import numpy as np
class CustomLoggingCallback(BaseCallback):
"""
Custom callback for logging additional training metrics.
"""
def __init__(self, verbose=0):
super().__init__(verbose)
self.episode_rewards = []
def _on_step(self) -> bool:
# Access training info
if len(self.model.ep_info_buffer) > 0:
# Log mean episode reward
mean_reward = np.mean([ep_info['r'] for ep_info in self.model.ep_info_buffer])
self.logger.record('custom/mean_ep_reward', mean_reward)
# Log mean episode length
mean_length = np.mean([ep_info['l'] for ep_info in self.model.ep_info_buffer])
self.logger.record('custom/mean_ep_length', mean_length)
# Continue training (return False to stop)
return True
def _on_training_end(self) -> None:
print(f"Training finished. Total episodes: {len(self.episode_rewards)}")
custom_callback = CustomLoggingCallback(verbose=1)
model.learn(total_timesteps=100000, callback=custom_callback)
Combining Multiple Callbacks
from stable_baselines3.common.callbacks import CallbackList
callbacks = CallbackList([
eval_callback,
checkpoint_callback,
custom_callback
])
model.learn(total_timesteps=500000, callback=callbacks)
TensorBoard Integration
Visualize training metrics in real-time:
model = PPO(
"MlpPolicy",
env,
verbose=1,
tensorboard_log="./tensorboard_logs/"
)
model.learn(total_timesteps=100000, tb_log_name="ppo_run_1")
Launch TensorBoard:
tensorboard --logdir ./tensorboard_logs/
# Open browser at http://localhost:6006
Key metrics to monitor:
rollout/ep_rew_mean: Average episode reward (primary success metric)train/entropy_loss: Exploration level (should decrease gradually)train/policy_loss: Policy optimization progresstrain/value_loss: Value function accuracytrain/learning_rate: Current learning rate (if using schedules)
Model Persistence and Transfer Learning
Saving and Loading Models
# Train and save
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)
model.save("ppo_model")
# Load and continue training
loaded_model = PPO.load("ppo_model", env=env)
loaded_model.learn(total_timesteps=50000) # Train 50k more steps
Transfer Learning: Pre-training on Easier Tasks
# Phase 1: Train on simple environment
easy_env = make_vec_env("CartPole-v1", n_envs=4)
model = PPO("MlpPolicy", easy_env, verbose=1)
model.learn(total_timesteps=100000)
model.save("pretrained_model")
# Phase 2: Fine-tune on harder environment
hard_env = make_vec_env("CartPole-v1", n_envs=4) # Could be custom env
model = PPO.load("pretrained_model", env=hard_env)
model.learn(total_timesteps=200000) # Fine-tune with more steps
Multi-Environment Training and Hardware Acceleration
Parallel Environment Sampling
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv
# Use subprocesses for CPU-intensive environments
env = make_vec_env(
"Pendulum-v1",
n_envs=16, # 16 parallel environments
vec_env_cls=SubprocVecEnv # Each in separate process
)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=500000) # 16x faster sampling
GPU Acceleration
import torch
# Force GPU usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = PPO(
"MlpPolicy",
env,
device=device, # Use GPU for network updates
verbose=1
)
print(f"Training on: {model.device}")
Debugging Training: Interpreting Key Metrics
Reward Curves Analysis
Healthy training:
– Steady upward trend in rollout/ep_rew_mean
– Low variance after initial exploration phase
Warning signs:
| Symptom | Possible Cause | Solution |
|---|---|---|
| Flat reward curve | Learning rate too low | Increase LR to 3e-4 or 1e-3 |
| Reward spikes then crashes | Learning rate too high | Decrease LR, add gradient clipping |
| Oscillating rewards | Unstable value function | Lower vf_coef, increase n_steps |
| No improvement after 100k steps | Wrong hyperparameters | Try different algorithm (SAC for continuous) |
KL Divergence Monitoring
KL divergence measures policy change between updates:
For PPO, monitor train/approx_kl:
- Healthy range: 0.01 – 0.05 (gradual policy updates)
- Too low (< 0.001): Learning too conservatively, increase LR
- Too high (> 0.1): Policy changing too fast, decrease LR or increase
clip_range
Explained Variance Interpretation
Explained variance measures value function accuracy:
Where:
– : Actual returns
– : Value function predictions
– EV = 1: Perfect predictions
– EV = 0: No better than mean
– EV < 0: Worse than predicting mean
Interpretation:
- EV > 0.8: Excellent value function
- 0.5 < EV < 0.8: Acceptable, training progressing
- EV < 0.5: Value function struggling, increase network size or decrease
gamma
Debugging Code Example
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback
class DiagnosticCallback(BaseCallback):
"""Monitor training health metrics."""
def _on_step(self) -> bool:
# Check for NaN in policy loss
if hasattr(self.model, 'logger'):
if self.model.logger.name_to_value.get('train/policy_gradient_loss', 0) != \
self.model.logger.name_to_value.get('train/policy_gradient_loss', 0):
print("WARNING: NaN detected in policy loss!")
return False # Stop training
# Check KL divergence
approx_kl = self.model.logger.name_to_value.get('train/approx_kl', 0)
if approx_kl > 0.1:
print(f"WARNING: High KL divergence ({approx_kl:.4f}). Policy changing too fast.")
return True
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log="./logs/")
model.learn(total_timesteps=100000, callback=DiagnosticCallback())
Conclusion
Stable Baselines3 transforms RL research into production-ready systems through its robust implementation and extensive tooling. The key takeaways:
- Algorithm selection matters: Use PPO for general tasks, SAC for sample efficiency, DQN for discrete spaces
- Hyperparameter tuning is critical: Start with defaults, then adjust learning rate,
n_steps, andgammabased on your task horizon - Preprocessing is essential: Always use
VecNormalizefor continuous control, scale rewards appropriately - Monitor training closely: Use TensorBoard and callbacks to catch instabilities early
- Debug systematically: Analyze reward curves, KL divergence, and explained variance to diagnose issues
In the next episode, we’ll dive deep into reward engineering—the art of shaping agent behaviors for complex tasks like financial trading and robotic manipulation. You’ll learn how to design reward functions that guide agents toward desired behaviors without introducing unintended consequences.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply