Part 4: Stable Baselines3: Practical Tips for Training Robust Agents

Updated Feb 6, 2026

Introduction

In the previous episodes, we explored the theoretical foundations of reinforcement learning through Markov Decision Processes, built custom Gym environments, and compared different algorithmic approaches. Now it’s time to bridge theory and practice using Stable Baselines3 (SB3), the industry-standard library for production-ready RL training.

Stable Baselines3 provides reliable, well-tested implementations of state-of-the-art RL algorithms with a consistent API. This episode covers the architecture of SB3, hyperparameter tuning strategies, training best practices, and debugging techniques that will help you train robust agents efficiently.

Stable Baselines3 Architecture and Supported Algorithms

Core Design Philosophy

SB3 is built on three key principles:

  1. Consistency: All algorithms share a common interface (learn(), predict(), save(), load())
  2. Modularity: Components like policies, buffers, and callbacks are interchangeable
  3. Reliability: Extensive testing ensures reproducible results

Supported Algorithms

SB3 implements major RL algorithm families:

Algorithm Type Action Space Best For Stability
PPO On-policy Discrete/Continuous General-purpose, robotics High
SAC Off-policy Continuous only Sample efficiency, fine control High
TD3 Off-policy Continuous only Robotics, low variance Medium
A2C On-policy Discrete/Continuous Fast prototyping Medium
DQN Off-policy Discrete only Atari games, simple tasks Medium

Installation and basic usage:

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Create vectorized environment (4 parallel instances)
env = make_vec_env("CartPole-v1", n_envs=4)

# Initialize PPO agent with default hyperparameters
model = PPO("MlpPolicy", env, verbose=1)

# Train for 100k timesteps
model.learn(total_timesteps=100000)

# Save trained model
model.save("ppo_cartpole")

# Load and evaluate
model = PPO.load("ppo_cartpole")
obs, info = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = env.step(action)

Hyperparameter Tuning: The Science of RL Training

Hyperparameters dramatically affect training performance. Here’s a systematic approach to tuning the most critical parameters.

Learning Rate Schedules

The learning rate α\alpha controls gradient descent step size. A decaying schedule often improves stability:

from stable_baselines3.common.callbacks import BaseCallback
import numpy as np

def linear_schedule(initial_value: float):
    """
    Linear learning rate schedule.

    Args:
        initial_value: Starting learning rate

    Returns:
        Schedule function that takes progress (0 to 1) and returns current LR
    """
    def func(progress_remaining: float) -> float:
        # progress_remaining goes from 1 to 0
        return progress_remaining * initial_value
    return func

# Apply linear decay from 3e-4 to 0
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=linear_schedule(3e-4),
    verbose=1
)

For exponential decay:

def exponential_schedule(initial_value: float, decay_rate: float = 0.96):
    def func(progress_remaining: float) -> float:
        # Exponential decay: lr = initial * decay^(1 - progress)
        return initial_value * (decay_rate ** (1 - progress_remaining))
    return func

model = SAC(
    "MlpPolicy",
    env,
    learning_rate=exponential_schedule(1e-3, decay_rate=0.95)
)

Critical Hyperparameters Explained

1. Batch Size and n_steps

For on-policy algorithms (PPO, A2C):

  • n_steps: Number of steps to collect before each update
  • batch_size: Minibatch size for gradient updates
  • Total experience per update: n_envs × n_steps
model = PPO(
    "MlpPolicy",
    env,
    n_steps=2048,        # Collect 2048 steps per env before update
    batch_size=64,       # Process in minibatches of 64
    n_epochs=10,         # 10 gradient descent passes per update
    verbose=1
)

Rule of thumb: Larger n_steps → more stable gradients but slower learning. Set batch_size to divide n_steps × n_envs evenly.

2. Discount Factor (Gamma)

The discount factor γ\gamma determines how much future rewards matter:

Rt=k=0γkrt+kR_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

Where:
RtR_t: Total discounted return from timestep tt
rt+kr_{t+k}: Reward at timestep t+kt+k
γ[0,1]\gamma \in [0, 1]: Discount factor

# Short-term tasks (quick rewards needed)
model = PPO("MlpPolicy", env, gamma=0.95)

# Long-term planning (robotics, strategy games)
model = PPO("MlpPolicy", env, gamma=0.99)

3. Entropy Coefficient

Entropy bonus H(π)H(\pi) encourages exploration by rewarding policy randomness:

Ltotal=Lpolicy+c1Lvaluec2H(π)L_{total} = L_{policy} + c_1 L_{value} – c_2 H(\pi)

Where:
LpolicyL_{policy}: Policy gradient loss
LvalueL_{value}: Value function loss
H(π)H(\pi): Policy entropy (higher = more exploration)
c2c_2: Entropy coefficient (typically 0.01)

# High exploration (complex environments)
model = PPO("MlpPolicy", env, ent_coef=0.02)

# Low exploration (near-optimal policy known)
model = PPO("MlpPolicy", env, ent_coef=0.0)

Hyperparameter Configuration Example

Here’s a production-ready PPO configuration for a continuous control task:

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Vectorize environment for parallel sampling
env = make_vec_env("Pendulum-v1", n_envs=8)

model = PPO(
    policy="MlpPolicy",
    env=env,
    learning_rate=linear_schedule(3e-4),  # Decay from 3e-4 to 0
    n_steps=2048,                          # 2048 steps per env
    batch_size=64,                         # Minibatch size
    n_epochs=10,                           # 10 optimization epochs
    gamma=0.99,                            # Strong future discounting
    gae_lambda=0.95,                       # GAE smoothing parameter
    clip_range=0.2,                        # PPO clipping range
    ent_coef=0.01,                         # Moderate exploration
    vf_coef=0.5,                           # Value function loss weight
    max_grad_norm=0.5,                     # Gradient clipping
    verbose=1,
    tensorboard_log="./ppo_pendulum_tensorboard/"
)

Training Best Practices

Observation Normalization with VecNormalize

Neural networks train best when inputs are normalized. VecNormalize automatically standardizes observations:

from stable_baselines3.common.vec_env import VecNormalize
from stable_baselines3.common.env_util import make_vec_env

# Create and wrap environment
env = make_vec_env("Pendulum-v1", n_envs=4)
env = VecNormalize(
    env,
    norm_obs=True,          # Normalize observations
    norm_reward=True,       # Normalize rewards
    clip_obs=10.0,          # Clip normalized obs to [-10, 10]
    clip_reward=10.0,       # Clip normalized rewards
    gamma=0.99              # Discount for reward normalization
)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)

# IMPORTANT: Save normalization statistics with model
model.save("ppo_normalized")
env.save("vec_normalize_stats.pkl")

# Load for inference
model = PPO.load("ppo_normalized")
env = make_vec_env("Pendulum-v1", n_envs=1)
env = VecNormalize.load("vec_normalize_stats.pkl", env)
env.training = False  # Disable updates during evaluation
env.norm_reward = False  # Don't normalize rewards during eval

Reward Scaling and Shaping

If your environment returns very large or small rewards, scale them:

import gymnasium as gym
import numpy as np

class RewardScalingWrapper(gym.RewardWrapper):
    """Scale rewards by constant factor."""
    def __init__(self, env, scale=0.01):
        super().__init__(env)
        self.scale = scale

    def reward(self, reward):
        return reward * self.scale

# Apply to environment
env = gym.make("YourEnv-v0")
env = RewardScalingWrapper(env, scale=0.01)

Action Clipping

Prevent extreme actions that might destabilize training:

class ActionClippingWrapper(gym.ActionWrapper):
    """Clip actions to safe range."""
    def __init__(self, env, min_action=-1.0, max_action=1.0):
        super().__init__(env)
        self.min_action = min_action
        self.max_action = max_action

    def action(self, action):
        return np.clip(action, self.min_action, self.max_action)

env = ActionClippingWrapper(env, min_action=-0.5, max_action=0.5)

Callbacks: Monitoring and Control

Callbacks hook into the training loop for logging, checkpointing, and early stopping.

EvalCallback: Periodic Evaluation

from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env

# Training environment
train_env = make_vec_env("CartPole-v1", n_envs=4)

# Separate evaluation environment (no training noise)
eval_env = make_vec_env("CartPole-v1", n_envs=1)

eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./logs/best_model/",
    log_path="./logs/results/",
    eval_freq=10000,         # Evaluate every 10k steps
    n_eval_episodes=10,      # Average over 10 episodes
    deterministic=True,      # Use deterministic policy
    render=False
)

model = PPO("MlpPolicy", train_env, verbose=1)
model.learn(total_timesteps=200000, callback=eval_callback)

CheckpointCallback: Regular Model Saving

from stable_baselines3.common.callbacks import CheckpointCallback

checkpoint_callback = CheckpointCallback(
    save_freq=50000,              # Save every 50k steps
    save_path="./checkpoints/",
    name_prefix="ppo_model"
)

model.learn(total_timesteps=500000, callback=checkpoint_callback)
# Creates: ppo_model_50000_steps.zip, ppo_model_100000_steps.zip, ...

Custom Callback: Advanced Logging

from stable_baselines3.common.callbacks import BaseCallback
import numpy as np

class CustomLoggingCallback(BaseCallback):
    """
    Custom callback for logging additional training metrics.
    """
    def __init__(self, verbose=0):
        super().__init__(verbose)
        self.episode_rewards = []

    def _on_step(self) -> bool:
        # Access training info
        if len(self.model.ep_info_buffer) > 0:
            # Log mean episode reward
            mean_reward = np.mean([ep_info['r'] for ep_info in self.model.ep_info_buffer])
            self.logger.record('custom/mean_ep_reward', mean_reward)

            # Log mean episode length
            mean_length = np.mean([ep_info['l'] for ep_info in self.model.ep_info_buffer])
            self.logger.record('custom/mean_ep_length', mean_length)

        # Continue training (return False to stop)
        return True

    def _on_training_end(self) -> None:
        print(f"Training finished. Total episodes: {len(self.episode_rewards)}")

custom_callback = CustomLoggingCallback(verbose=1)
model.learn(total_timesteps=100000, callback=custom_callback)

Combining Multiple Callbacks

from stable_baselines3.common.callbacks import CallbackList

callbacks = CallbackList([
    eval_callback,
    checkpoint_callback,
    custom_callback
])

model.learn(total_timesteps=500000, callback=callbacks)

TensorBoard Integration

Visualize training metrics in real-time:

model = PPO(
    "MlpPolicy",
    env,
    verbose=1,
    tensorboard_log="./tensorboard_logs/"
)

model.learn(total_timesteps=100000, tb_log_name="ppo_run_1")

Launch TensorBoard:

tensorboard --logdir ./tensorboard_logs/
# Open browser at http://localhost:6006

Key metrics to monitor:

  • rollout/ep_rew_mean: Average episode reward (primary success metric)
  • train/entropy_loss: Exploration level (should decrease gradually)
  • train/policy_loss: Policy optimization progress
  • train/value_loss: Value function accuracy
  • train/learning_rate: Current learning rate (if using schedules)

Model Persistence and Transfer Learning

Saving and Loading Models

# Train and save
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)
model.save("ppo_model")

# Load and continue training
loaded_model = PPO.load("ppo_model", env=env)
loaded_model.learn(total_timesteps=50000)  # Train 50k more steps

Transfer Learning: Pre-training on Easier Tasks

# Phase 1: Train on simple environment
easy_env = make_vec_env("CartPole-v1", n_envs=4)
model = PPO("MlpPolicy", easy_env, verbose=1)
model.learn(total_timesteps=100000)
model.save("pretrained_model")

# Phase 2: Fine-tune on harder environment
hard_env = make_vec_env("CartPole-v1", n_envs=4)  # Could be custom env
model = PPO.load("pretrained_model", env=hard_env)
model.learn(total_timesteps=200000)  # Fine-tune with more steps

Multi-Environment Training and Hardware Acceleration

Parallel Environment Sampling

from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Use subprocesses for CPU-intensive environments
env = make_vec_env(
    "Pendulum-v1",
    n_envs=16,              # 16 parallel environments
    vec_env_cls=SubprocVecEnv  # Each in separate process
)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=500000)  # 16x faster sampling

GPU Acceleration

import torch

# Force GPU usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = PPO(
    "MlpPolicy",
    env,
    device=device,  # Use GPU for network updates
    verbose=1
)

print(f"Training on: {model.device}")

Debugging Training: Interpreting Key Metrics

Reward Curves Analysis

Healthy training:
– Steady upward trend in rollout/ep_rew_mean
– Low variance after initial exploration phase

Warning signs:

Symptom Possible Cause Solution
Flat reward curve Learning rate too low Increase LR to 3e-4 or 1e-3
Reward spikes then crashes Learning rate too high Decrease LR, add gradient clipping
Oscillating rewards Unstable value function Lower vf_coef, increase n_steps
No improvement after 100k steps Wrong hyperparameters Try different algorithm (SAC for continuous)

KL Divergence Monitoring

KL divergence DKLD_{KL} measures policy change between updates:

DKL(πoldπnew)=aπold(as)logπold(as)πnew(as)D_{KL}(\pi_{old} || \pi_{new}) = \sum_a \pi_{old}(a|s) \log \frac{\pi_{old}(a|s)}{\pi_{new}(a|s)}

For PPO, monitor train/approx_kl:

  • Healthy range: 0.01 – 0.05 (gradual policy updates)
  • Too low (< 0.001): Learning too conservatively, increase LR
  • Too high (> 0.1): Policy changing too fast, decrease LR or increase clip_range

Explained Variance Interpretation

Explained variance measures value function accuracy:

EV=1Var(RV)Var(R)\text{EV} = 1 – \frac{\text{Var}(R – V)}{\text{Var}(R)}

Where:
RR: Actual returns
VV: Value function predictions
– EV = 1: Perfect predictions
– EV = 0: No better than mean
– EV < 0: Worse than predicting mean

Interpretation:

  • EV > 0.8: Excellent value function
  • 0.5 < EV < 0.8: Acceptable, training progressing
  • EV < 0.5: Value function struggling, increase network size or decrease gamma

Debugging Code Example

import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback

class DiagnosticCallback(BaseCallback):
    """Monitor training health metrics."""
    def _on_step(self) -> bool:
        # Check for NaN in policy loss
        if hasattr(self.model, 'logger'):
            if self.model.logger.name_to_value.get('train/policy_gradient_loss', 0) != \
               self.model.logger.name_to_value.get('train/policy_gradient_loss', 0):
                print("WARNING: NaN detected in policy loss!")
                return False  # Stop training

        # Check KL divergence
        approx_kl = self.model.logger.name_to_value.get('train/approx_kl', 0)
        if approx_kl > 0.1:
            print(f"WARNING: High KL divergence ({approx_kl:.4f}). Policy changing too fast.")

        return True

model = PPO("MlpPolicy", env, verbose=1, tensorboard_log="./logs/")
model.learn(total_timesteps=100000, callback=DiagnosticCallback())

Conclusion

Stable Baselines3 transforms RL research into production-ready systems through its robust implementation and extensive tooling. The key takeaways:

  1. Algorithm selection matters: Use PPO for general tasks, SAC for sample efficiency, DQN for discrete spaces
  2. Hyperparameter tuning is critical: Start with defaults, then adjust learning rate, n_steps, and gamma based on your task horizon
  3. Preprocessing is essential: Always use VecNormalize for continuous control, scale rewards appropriately
  4. Monitor training closely: Use TensorBoard and callbacks to catch instabilities early
  5. Debug systematically: Analyze reward curves, KL divergence, and explained variance to diagnose issues

In the next episode, we’ll dive deep into reward engineering—the art of shaping agent behaviors for complex tasks like financial trading and robotic manipulation. You’ll learn how to design reward functions that guide agents toward desired behaviors without introducing unintended consequences.

Deep Reinforcement Learning: From Theory to Custom Environments Series (4/6)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 359 | TOTAL 2,582