Part 4: Stable Baselines3: Practical Tips for Training Robust Agents

Q: How does TensorBoard Integration work?

Visualize training metrics in real-time: model = PPO( "MlpPolicy", env, verbose=1, tensorboard_log="./tensorboard_logs/" ) model.learn(total_timesteps=100000, tb_log_name="ppo_run_1") Launch TensorBoard: tensorboard --logdir ./tensorboard_logs/ # Op

Updated Feb 6, 2026

Introduction

In the previous episodes, we explored the theoretical foundations of reinforcement learning through Markov Decision Processes, built custom Gym environments, and compared different algorithmic approaches. Now it’s time to bridge theory and practice using Stable Baselines3 (SB3), the industry-standard library for production-ready RL training.

Stable Baselines3 provides reliable, well-tested implementations of state-of-the-art RL algorithms with a consistent API. This episode covers the architecture of SB3, hyperparameter tuning strategies, training best practices, and debugging techniques that will help you train robust agents efficiently.

Stable Baselines3 Architecture and Supported Algorithms

Core Design Philosophy

SB3 is built on three key principles:

Consistency: All algorithms share a common interface (learn(), predict(), save(), load())
Modularity: Components like policies, buffers, and callbacks are interchangeable
Reliability: Extensive testing ensures reproducible results

Supported Algorithms

SB3 implements major RL algorithm families:

Algorithm	Type	Action Space	Best For	Stability
PPO	On-policy	Discrete/Continuous	General-purpose, robotics	High
SAC	Off-policy	Continuous only	Sample efficiency, fine control	High
TD3	Off-policy	Continuous only	Robotics, low variance	Medium
A2C	On-policy	Discrete/Continuous	Fast prototyping	Medium
DQN	Off-policy	Discrete only	Atari games, simple tasks	Medium

Installation and basic usage:

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Create vectorized environment (4 parallel instances)
env = make_vec_env("CartPole-v1", n_envs=4)

# Initialize PPO agent with default hyperparameters
model = PPO("MlpPolicy", env, verbose=1)

# Train for 100k timesteps
model.learn(total_timesteps=100000)

# Save trained model
model.save("ppo_cartpole")

# Load and evaluate
model = PPO.load("ppo_cartpole")
obs, info = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = env.step(action)

Hyperparameter Tuning: The Science of RL Training

Hyperparameters dramatically affect training performance. Here’s a systematic approach to tuning the most critical parameters.

Learning Rate Schedules

The learning rate $\alpha$ controls gradient descent step size. A decaying schedule often improves stability:

from stable_baselines3.common.callbacks import BaseCallback
import numpy as np

def linear_schedule(initial_value: float):
    """
    Linear learning rate schedule.

    Args:
        initial_value: Starting learning rate

    Returns:
        Schedule function that takes progress (0 to 1) and returns current LR
    """
    def func(progress_remaining: float) -> float:
        # progress_remaining goes from 1 to 0
        return progress_remaining * initial_value
    return func

# Apply linear decay from 3e-4 to 0
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=linear_schedule(3e-4),
    verbose=1
)

For exponential decay:

def exponential_schedule(initial_value: float, decay_rate: float = 0.96):
    def func(progress_remaining: float) -> float:
        # Exponential decay: lr = initial * decay^(1 - progress)
        return initial_value * (decay_rate ** (1 - progress_remaining))
    return func

model = SAC(
    "MlpPolicy",
    env,
    learning_rate=exponential_schedule(1e-3, decay_rate=0.95)
)

Critical Hyperparameters Explained

1. Batch Size and n_steps

For on-policy algorithms (PPO, A2C):

n_steps: Number of steps to collect before each update
batch_size: Minibatch size for gradient updates
Total experience per update: n_envs × n_steps

model = PPO(
    "MlpPolicy",
    env,
    n_steps=2048,        # Collect 2048 steps per env before update
    batch_size=64,       # Process in minibatches of 64
    n_epochs=10,         # 10 gradient descent passes per update
    verbose=1
)

Rule of thumb: Larger n_steps → more stable gradients but slower learning. Set batch_size to divide n_steps × n_envs evenly.

2. Discount Factor (Gamma)

The discount factor $\gamma$ determines how much future rewards matter:

$R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$

Where:
– $R_t$ : Total discounted return from timestep $t$
– $r_{t+k}$ : Reward at timestep $t+k$
– $\gamma \in [0, 1]$ : Discount factor

# Short-term tasks (quick rewards needed)
model = PPO("MlpPolicy", env, gamma=0.95)

# Long-term planning (robotics, strategy games)
model = PPO("MlpPolicy", env, gamma=0.99)

3. Entropy Coefficient

Entropy bonus $H(\pi)$ encourages exploration by rewarding policy randomness:

$L_{total} = L_{policy} + c_1 L_{value} – c_2 H(\pi)$

Where:
– $L_{policy}$ : Policy gradient loss
– $L_{value}$ : Value function loss
– $H(\pi)$ : Policy entropy (higher = more exploration)
– $c_2$ : Entropy coefficient (typically 0.01)

# High exploration (complex environments)
model = PPO("MlpPolicy", env, ent_coef=0.02)

# Low exploration (near-optimal policy known)
model = PPO("MlpPolicy", env, ent_coef=0.0)

Hyperparameter Configuration Example

Here’s a production-ready PPO configuration for a continuous control task:

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

# Vectorize environment for parallel sampling
env = make_vec_env("Pendulum-v1", n_envs=8)

model = PPO(
    policy="MlpPolicy",
    env=env,
    learning_rate=linear_schedule(3e-4),  # Decay from 3e-4 to 0
    n_steps=2048,                          # 2048 steps per env
    batch_size=64,                         # Minibatch size
    n_epochs=10,                           # 10 optimization epochs
    gamma=0.99,                            # Strong future discounting
    gae_lambda=0.95,                       # GAE smoothing parameter
    clip_range=0.2,                        # PPO clipping range
    ent_coef=0.01,                         # Moderate exploration
    vf_coef=0.5,                           # Value function loss weight
    max_grad_norm=0.5,                     # Gradient clipping
    verbose=1,
    tensorboard_log="./ppo_pendulum_tensorboard/"
)

Training Best Practices

Observation Normalization with VecNormalize

Neural networks train best when inputs are normalized. VecNormalize automatically standardizes observations:

from stable_baselines3.common.vec_env import VecNormalize
from stable_baselines3.common.env_util import make_vec_env

# Create and wrap environment
env = make_vec_env("Pendulum-v1", n_envs=4)
env = VecNormalize(
    env,
    norm_obs=True,          # Normalize observations
    norm_reward=True,       # Normalize rewards
    clip_obs=10.0,          # Clip normalized obs to [-10, 10]
    clip_reward=10.0,       # Clip normalized rewards
    gamma=0.99              # Discount for reward normalization
)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)

# IMPORTANT: Save normalization statistics with model
model.save("ppo_normalized")
env.save("vec_normalize_stats.pkl")

# Load for inference
model = PPO.load("ppo_normalized")
env = make_vec_env("Pendulum-v1", n_envs=1)
env = VecNormalize.load("vec_normalize_stats.pkl", env)
env.training = False  # Disable updates during evaluation
env.norm_reward = False  # Don't normalize rewards during eval

Reward Scaling and Shaping

If your environment returns very large or small rewards, scale them:

import gymnasium as gym
import numpy as np

class RewardScalingWrapper(gym.RewardWrapper):
    """Scale rewards by constant factor."""
    def __init__(self, env, scale=0.01):
        super().__init__(env)
        self.scale = scale

    def reward(self, reward):
        return reward * self.scale

# Apply to environment
env = gym.make("YourEnv-v0")
env = RewardScalingWrapper(env, scale=0.01)

Action Clipping

Prevent extreme actions that might destabilize training:

class ActionClippingWrapper(gym.ActionWrapper):
    """Clip actions to safe range."""
    def __init__(self, env, min_action=-1.0, max_action=1.0):
        super().__init__(env)
        self.min_action = min_action
        self.max_action = max_action

    def action(self, action):
        return np.clip(action, self.min_action, self.max_action)

env = ActionClippingWrapper(env, min_action=-0.5, max_action=0.5)

Callbacks: Monitoring and Control

Callbacks hook into the training loop for logging, checkpointing, and early stopping.

EvalCallback: Periodic Evaluation

from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env

# Training environment
train_env = make_vec_env("CartPole-v1", n_envs=4)

# Separate evaluation environment (no training noise)
eval_env = make_vec_env("CartPole-v1", n_envs=1)

eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./logs/best_model/",
    log_path="./logs/results/",
    eval_freq=10000,         # Evaluate every 10k steps
    n_eval_episodes=10,      # Average over 10 episodes
    deterministic=True,      # Use deterministic policy
    render=False
)

model = PPO("MlpPolicy", train_env, verbose=1)
model.learn(total_timesteps=200000, callback=eval_callback)

CheckpointCallback: Regular Model Saving

from stable_baselines3.common.callbacks import CheckpointCallback

checkpoint_callback = CheckpointCallback(
    save_freq=50000,              # Save every 50k steps
    save_path="./checkpoints/",
    name_prefix="ppo_model"
)

model.learn(total_timesteps=500000, callback=checkpoint_callback)
# Creates: ppo_model_50000_steps.zip, ppo_model_100000_steps.zip, ...

Custom Callback: Advanced Logging

from stable_baselines3.common.callbacks import BaseCallback
import numpy as np

class CustomLoggingCallback(BaseCallback):
    """
    Custom callback for logging additional training metrics.
    """
    def __init__(self, verbose=0):
        super().__init__(verbose)
        self.episode_rewards = []

    def _on_step(self) -> bool:
        # Access training info
        if len(self.model.ep_info_buffer) > 0:
            # Log mean episode reward
            mean_reward = np.mean([ep_info['r'] for ep_info in self.model.ep_info_buffer])
            self.logger.record('custom/mean_ep_reward', mean_reward)

            # Log mean episode length
            mean_length = np.mean([ep_info['l'] for ep_info in self.model.ep_info_buffer])
            self.logger.record('custom/mean_ep_length', mean_length)

        # Continue training (return False to stop)
        return True

    def _on_training_end(self) -> None:
        print(f"Training finished. Total episodes: {len(self.episode_rewards)}")

custom_callback = CustomLoggingCallback(verbose=1)
model.learn(total_timesteps=100000, callback=custom_callback)

Combining Multiple Callbacks

from stable_baselines3.common.callbacks import CallbackList

callbacks = CallbackList([
    eval_callback,
    checkpoint_callback,
    custom_callback
])

model.learn(total_timesteps=500000, callback=callbacks)

TensorBoard Integration

Visualize training metrics in real-time:

model = PPO(
    "MlpPolicy",
    env,
    verbose=1,
    tensorboard_log="./tensorboard_logs/"
)

model.learn(total_timesteps=100000, tb_log_name="ppo_run_1")

Launch TensorBoard:

tensorboard --logdir ./tensorboard_logs/
# Open browser at http://localhost:6006

Key metrics to monitor:

rollout/ep_rew_mean: Average episode reward (primary success metric)
train/entropy_loss: Exploration level (should decrease gradually)
train/policy_loss: Policy optimization progress
train/value_loss: Value function accuracy
train/learning_rate: Current learning rate (if using schedules)

Model Persistence and Transfer Learning

Saving and Loading Models

# Train and save
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)
model.save("ppo_model")

# Load and continue training
loaded_model = PPO.load("ppo_model", env=env)
loaded_model.learn(total_timesteps=50000)  # Train 50k more steps

Transfer Learning: Pre-training on Easier Tasks

# Phase 1: Train on simple environment
easy_env = make_vec_env("CartPole-v1", n_envs=4)
model = PPO("MlpPolicy", easy_env, verbose=1)
model.learn(total_timesteps=100000)
model.save("pretrained_model")

# Phase 2: Fine-tune on harder environment
hard_env = make_vec_env("CartPole-v1", n_envs=4)  # Could be custom env
model = PPO.load("pretrained_model", env=hard_env)
model.learn(total_timesteps=200000)  # Fine-tune with more steps

Multi-Environment Training and Hardware Acceleration

Parallel Environment Sampling

from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

# Use subprocesses for CPU-intensive environments
env = make_vec_env(
    "Pendulum-v1",
    n_envs=16,              # 16 parallel environments
    vec_env_cls=SubprocVecEnv  # Each in separate process
)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=500000)  # 16x faster sampling

GPU Acceleration

import torch

# Force GPU usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = PPO(
    "MlpPolicy",
    env,
    device=device,  # Use GPU for network updates
    verbose=1
)

print(f"Training on: {model.device}")

Debugging Training: Interpreting Key Metrics

Reward Curves Analysis

Healthy training:
– Steady upward trend in rollout/ep_rew_mean
– Low variance after initial exploration phase

Warning signs:

Symptom	Possible Cause	Solution
Flat reward curve	Learning rate too low	Increase LR to 3e-4 or 1e-3
Reward spikes then crashes	Learning rate too high	Decrease LR, add gradient clipping
Oscillating rewards	Unstable value function	Lower `vf_coef`, increase `n_steps`
No improvement after 100k steps	Wrong hyperparameters	Try different algorithm (SAC for continuous)

KL Divergence Monitoring

KL divergence $D_{KL}$ measures policy change between updates:

$D_{KL}(\pi_{old} || \pi_{new}) = \sum_a \pi_{old}(a|s) \log \frac{\pi_{old}(a|s)}{\pi_{new}(a|s)}$

For PPO, monitor train/approx_kl:

Healthy range: 0.01 – 0.05 (gradual policy updates)
Too low (< 0.001): Learning too conservatively, increase LR
Too high (> 0.1): Policy changing too fast, decrease LR or increase clip_range

Explained Variance Interpretation

Explained variance measures value function accuracy:

$\text{EV} = 1 – \frac{\text{Var}(R – V)}{\text{Var}(R)}$

Where:
– $R$ : Actual returns
– $V$ : Value function predictions
– EV = 1: Perfect predictions
– EV = 0: No better than mean
– EV < 0: Worse than predicting mean

Interpretation:

EV > 0.8: Excellent value function
0.5 < EV < 0.8: Acceptable, training progressing
EV < 0.5: Value function struggling, increase network size or decrease gamma

Debugging Code Example

import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback

class DiagnosticCallback(BaseCallback):
    """Monitor training health metrics."""
    def _on_step(self) -> bool:
        # Check for NaN in policy loss
        if hasattr(self.model, 'logger'):
            if self.model.logger.name_to_value.get('train/policy_gradient_loss', 0) != \
               self.model.logger.name_to_value.get('train/policy_gradient_loss', 0):
                print("WARNING: NaN detected in policy loss!")
                return False  # Stop training

        # Check KL divergence
        approx_kl = self.model.logger.name_to_value.get('train/approx_kl', 0)
        if approx_kl > 0.1:
            print(f"WARNING: High KL divergence ({approx_kl:.4f}). Policy changing too fast.")

        return True

model = PPO("MlpPolicy", env, verbose=1, tensorboard_log="./logs/")
model.learn(total_timesteps=100000, callback=DiagnosticCallback())

Conclusion

Stable Baselines3 transforms RL research into production-ready systems through its robust implementation and extensive tooling. The key takeaways:

Algorithm selection matters: Use PPO for general tasks, SAC for sample efficiency, DQN for discrete spaces
Hyperparameter tuning is critical: Start with defaults, then adjust learning rate, n_steps, and gamma based on your task horizon
Preprocessing is essential: Always use VecNormalize for continuous control, scale rewards appropriately
Monitor training closely: Use TensorBoard and callbacks to catch instabilities early
Debug systematically: Analyze reward curves, KL divergence, and explained variance to diagnose issues

In the next episode, we’ll dive deep into reward engineering—the art of shaping agent behaviors for complex tasks like financial trading and robotic manipulation. You’ll learn how to design reward functions that guide agents toward desired behaviors without introducing unintended consequences.

Deep Reinforcement Learning: From Theory to Custom Environments Series (4/6)

← Previous: Part 3: Policy Gradient vs. Q-Learning: Choosing the Right Agent Next: Part 5: Reward Engineering: How to Shape Behaviors in Financial/Robotic Tasks →

Did you find this helpful?

☕ Buy me a coffee

Part 4: Stable Baselines3: Practical Tips for Training Robust Agents

Introduction

Stable Baselines3 Architecture and Supported Algorithms

Core Design Philosophy

Supported Algorithms

Hyperparameter Tuning: The Science of RL Training

Learning Rate Schedules

Critical Hyperparameters Explained

Hyperparameter Configuration Example

Training Best Practices

Observation Normalization with VecNormalize

Reward Scaling and Shaping

Action Clipping

Callbacks: Monitoring and Control

EvalCallback: Periodic Evaluation

CheckpointCallback: Regular Model Saving

Custom Callback: Advanced Logging

Combining Multiple Callbacks

TensorBoard Integration

Model Persistence and Transfer Learning

Saving and Loading Models

Transfer Learning: Pre-training on Easier Tasks

Multi-Environment Training and Hardware Acceleration

Parallel Environment Sampling

GPU Acceleration

Debugging Training: Interpreting Key Metrics

Reward Curves Analysis

KL Divergence Monitoring

Explained Variance Interpretation

Debugging Code Example

Conclusion

Comments

Leave a Reply Cancel reply

Part 4: Stable Baselines3: Practical Tips for Training Robust Agents

Introduction

Stable Baselines3 Architecture and Supported Algorithms

Core Design Philosophy

Supported Algorithms

Hyperparameter Tuning: The Science of RL Training

Learning Rate Schedules

Critical Hyperparameters Explained

Hyperparameter Configuration Example

Training Best Practices

Observation Normalization with VecNormalize

Reward Scaling and Shaping

Action Clipping

Callbacks: Monitoring and Control

EvalCallback: Periodic Evaluation

CheckpointCallback: Regular Model Saving

Custom Callback: Advanced Logging

Combining Multiple Callbacks

TensorBoard Integration

Model Persistence and Transfer Learning

Saving and Loading Models

Transfer Learning: Pre-training on Easier Tasks

Multi-Environment Training and Hardware Acceleration

Parallel Environment Sampling

GPU Acceleration

Debugging Training: Interpreting Key Metrics

Reward Curves Analysis

KL Divergence Monitoring

Explained Variance Interpretation

Debugging Code Example

Conclusion

Related Posts

Comments

Leave a Reply Cancel reply