Part 3: Policy Gradient vs. Q-Learning: Choosing the Right Agent

Updated Feb 6, 2026

Introduction

In the previous episodes, we explored the theoretical foundations of Reinforcement Learning through Markov Decision Processes and built our first custom Gym environment. Now we face a critical decision: which algorithm should we use to train our agent?

The RL algorithm landscape can be overwhelming, but most modern methods fall into two fundamental paradigms: value-based methods (like Q-Learning and DQN) and policy gradient methods (like REINFORCE and PPO). Understanding the differences between these approaches—and knowing when to use each—is essential for any RL practitioner.

In this episode, we’ll dissect both paradigms, explore their mathematical foundations, examine hybrid actor-critic methods, and provide practical guidance for choosing the right algorithm for your problem.

Value-Based Methods: Learning the Value of Actions

The Q-Learning Foundation

Value-based methods learn a value function that estimates how good it is to be in a given state (or to take a specific action in that state). The agent then derives its policy by simply choosing actions with the highest estimated value.

The core idea is captured in the Q-function Q(s,a)Q(s, a), which represents the expected cumulative reward for taking action aa in state ss and following the optimal policy thereafter:

Q(s,a)=E[rt+γrt+1+γ2rt+2+st=s,at=a]Q^*(s, a) = \mathbb{E}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots \mid s_t = s, a_t = a\right]

Where:
Q(s,a)Q^*(s, a) is the optimal action-value function
rtr_t is the reward at timestep tt
γ[0,1]\gamma \in [0, 1] is the discount factor (how much we value future rewards)
E[]\mathbb{E}[\cdot] denotes the expected value

From Tabular Q-Learning to Deep Q-Networks (DQN)

Classic Q-Learning uses a table to store Q-values for each state-action pair. However, this becomes infeasible for high-dimensional state spaces (like images). Deep Q-Networks (DQN) solve this by using a neural network to approximate the Q-function:

import torch
import torch.nn as nn

class DQN(nn.Module):
    """Deep Q-Network for estimating action values"""
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(DQN, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)  # Output Q-value for each action
        )

    def forward(self, state):
        return self.network(state)

The network is trained using the Bellman equation as a loss function:

L(θ)=E(s,a,r,s)[(r+γmaxaQ(s,a;θ)Q(s,a;θ))2]L(\theta) = \mathbb{E}_{(s,a,r,s')}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) – Q(s, a; \theta)\right)^2\right]

Where:
θ\theta are the network parameters
θ\theta^- are the target network parameters (explained below)
(s,a,r,s)(s, a, r, s') is a transition tuple from the replay buffer

Key DQN Innovations

1. Experience Replay

DQN stores transitions in a replay buffer and samples random minibatches for training. This breaks correlation between consecutive samples and improves data efficiency:

import random
from collections import deque

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples"""
    def __init__(self, capacity=100000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

    def __len__(self):
        return len(self.buffer)

2. Target Networks

Using the same network to compute both the current Q-value and the target creates instability. DQN uses a separate target network Q(s,a;θ)Q(s, a; \theta^-) that is updated periodically:

# Update target network every N steps
if step % target_update_freq == 0:
    target_network.load_state_dict(policy_network.state_dict())

Advanced Value-Based Methods

Double DQN (DDQN)

Standard DQN tends to overestimate Q-values due to the max operation. DDQN addresses this by decoupling action selection from action evaluation:

YtDDQN=rt+γQ(st+1,argmaxaQ(st+1,a;θt),θt)Y_t^{\text{DDQN}} = r_t + \gamma Q\left(s_{t+1}, \arg\max_{a'} Q(s_{t+1}, a'; \theta_t), \theta_t^-\right)

The online network chooses the action, but the target network evaluates it.

Dueling DQN

Dueling DQN decomposes the Q-function into a state-value function V(s)V(s) and an advantage function A(s,a)A(s, a):

Q(s,a)=V(s)+(A(s,a)1AaA(s,a))Q(s, a) = V(s) + \left(A(s, a) – \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a')\right)

Where:
V(s)V(s) represents the value of being in state ss
A(s,a)A(s, a) represents the advantage of taking action aa over the average
A|\mathcal{A}| is the number of possible actions

This architecture learns which states are valuable independently of the action choice, improving learning efficiency.

Policy Gradient Methods: Directly Optimizing the Policy

The Policy Gradient Theorem

Unlike value-based methods, policy gradient approaches directly parameterize the policy π(as;θ)\pi(a|s; \theta) and optimize it using gradient ascent on the expected return:

J(θ)=Eτπθ[t=0Tγtrt]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \gamma^t r_t\right]

Where:
τ=(s0,a0,r0,s1,)\tau = (s_0, a_0, r_0, s_1, \ldots) is a trajectory
πθ\pi_\theta is the policy parameterized by θ\theta

The policy gradient theorem shows how to compute the gradient of this objective:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]

Where:
Gt=k=0Ttγkrt+kG_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k} is the return from timestep tt
θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t|s_t) is the gradient of the log-probability

REINFORCE: The Vanilla Policy Gradient

REINFORCE is the simplest policy gradient algorithm, directly implementing the policy gradient theorem:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class PolicyNetwork(nn.Module):
    """Stochastic policy for continuous or discrete actions"""
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(PolicyNetwork, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)  # For discrete actions
        )

    def forward(self, state):
        return self.network(state)

def train_reinforce(policy, optimizer, trajectories):
    """Update policy using REINFORCE algorithm"""
    policy_loss = []

    for trajectory in trajectories:
        states, actions, rewards = trajectory

        # Compute discounted returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.tensor(returns)

        # Normalize returns (reduces variance)
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)

        # Compute policy gradient
        for state, action, G in zip(states, actions, returns):
            action_probs = policy(state)
            log_prob = torch.log(action_probs[action])
            policy_loss.append(-log_prob * G)  # Negative for gradient ascent

    optimizer.zero_grad()
    loss = torch.stack(policy_loss).sum()
    loss.backward()
    optimizer.step()

Reducing Variance: The Advantage Function

REINFORCE suffers from high variance because returns can vary significantly. We can reduce variance by subtracting a baseline b(s)b(s):

θJ(θ)1Ni=1Nt=0Tθlogπθ(atisti)(Gtib(sti))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t^i|s_t^i) \cdot (G_t^i – b(s_t^i))

A natural choice for the baseline is the state-value function V(s)V(s), giving us the advantage function:

A(s,a)=Q(s,a)V(s)A(s, a) = Q(s, a) – V(s)

The advantage represents how much better action aa is compared to the average action in state ss.

Proximal Policy Optimization (PPO)

PPO is the current gold standard for policy gradient methods. It prevents destructively large policy updates using a clipped surrogate objective:

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]

Where:
rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} is the probability ratio
A^t\hat{A}_t is the estimated advantage
ϵ\epsilon is the clipping parameter (typically 0.2)

The clipping ensures the new policy doesn’t deviate too far from the old one:

from stable_baselines3 import PPO
import gymnasium as gym

# Training a PPO agent (from Stable Baselines3)
env = gym.make('CartPole-v1')
model = PPO(
    'MlpPolicy',
    env,
    learning_rate=3e-4,
    n_steps=2048,        # Rollout length
    batch_size=64,
    n_epochs=10,         # Optimization epochs per rollout
    clip_range=0.2,      # PPO clipping parameter
    verbose=1
)

model.learn(total_timesteps=100000)
model.save('ppo_cartpole')

Advantage Actor-Critic (A2C/A3C)

A2C combines policy gradients with value function learning. It maintains two networks:
Actor: the policy πθ(as)\pi_\theta(a|s)
Critic: the value function Vϕ(s)V_\phi(s)

The critic provides the baseline for advantage estimation:

A(st,at)=rt+γVϕ(st+1)Vϕ(st)A(s_t, a_t) = r_t + \gamma V_\phi(s_{t+1}) – V_\phi(s_t)

This is called Temporal Difference (TD) learning because it uses the difference between consecutive value estimates.

Actor-Critic Methods: The Best of Both Worlds

Actor-critic methods combine value-based and policy-based approaches to achieve better sample efficiency and stability.

Soft Actor-Critic (SAC)

SAC is an off-policy actor-critic algorithm designed for continuous control. It maximizes both the expected return and the policy entropy:

J(π)=Eτπ[t=0Tγt(rt+αH(π(st)))]J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^T \gamma^t (r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)))\right]

Where:
H(π(st))=Eaπ[logπ(ast)]\mathcal{H}(\pi(\cdot|s_t)) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s_t)] is the entropy
α\alpha is the temperature parameter controlling exploration

The entropy term encourages exploration and prevents premature convergence to suboptimal policies.

from stable_baselines3 import SAC

# SAC is ideal for continuous control tasks
env = gym.make('Pendulum-v1')
model = SAC(
    'MlpPolicy',
    env,
    learning_rate=3e-4,
    buffer_size=1000000,
    learning_starts=100,
    batch_size=256,
    tau=0.005,            # Target network soft update rate
    gamma=0.99,
    train_freq=1,
    gradient_steps=1,
    ent_coef='auto',      # Automatic entropy tuning
    verbose=1
)

model.learn(total_timesteps=50000)

Twin Delayed DDPG (TD3)

TD3 improves upon DDPG by addressing overestimation bias and variance. It uses:
1. Twin critics: two Q-networks, taking the minimum to reduce overestimation
2. Delayed policy updates: update the actor less frequently than critics
3. Target policy smoothing: add noise to target actions to regularize

Comparison: When to Use Which Algorithm

Algorithm Selection Table

Criterion DQN/DDQN PPO SAC/TD3
Action Space Discrete only Both Continuous only
Sample Efficiency High (off-policy) Medium (on-policy) High (off-policy)
Stability Medium High High
Hyperparameter Sensitivity High Low Medium
Computational Cost Low Medium High
Exploration ε-greedy Stochastic policy Entropy maximization
Use Cases Atari games, discrete control Robotics, general-purpose Continuous control, robotics

Discrete vs. Continuous Actions

Discrete actions (e.g., “left”, “right”, “jump”):
– Use DQN variants for simple environments
– Use PPO or A2C for more complex tasks or when stability is critical

Continuous actions (e.g., motor torques, steering angles):
– Use SAC for sample-efficient learning with strong exploration
– Use TD3 for slightly faster training with less hyperparameter tuning
– Use PPO when you need stability and can afford more samples

Sample Efficiency vs. Stability Trade-off

Off-policy algorithms (DQN, SAC, TD3) learn from any past experience stored in a replay buffer. This makes them sample-efficient but potentially less stable.

On-policy algorithms (PPO, A2C) only learn from recently collected data. This makes them more stable but less sample-efficient.

Practical Code Example: Algorithm Comparison

import gymnasium as gym
from stable_baselines3 import DQN, PPO, SAC
from stable_baselines3.common.evaluation import evaluate_policy
import time

def benchmark_algorithm(env_name, algorithm_class, total_timesteps=50000):
    """Benchmark an RL algorithm on a given environment"""
    env = gym.make(env_name)

    start_time = time.time()
    model = algorithm_class('MlpPolicy', env, verbose=0)
    model.learn(total_timesteps=total_timesteps)
    training_time = time.time() - start_time

    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)

    print(f"{algorithm_class.__name__} on {env_name}:")
    print(f"  Training time: {training_time:.2f}s")
    print(f"  Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")

    env.close()
    return mean_reward, training_time

# Compare algorithms on CartPole (discrete actions)
print("=== Discrete Action Environment ===")
benchmark_algorithm('CartPole-v1', DQN, total_timesteps=50000)
benchmark_algorithm('CartPole-v1', PPO, total_timesteps=50000)

# Compare algorithms on Pendulum (continuous actions)
print("\n=== Continuous Action Environment ===")
benchmark_algorithm('Pendulum-v1', PPO, total_timesteps=50000)
benchmark_algorithm('Pendulum-v1', SAC, total_timesteps=50000)

Common Pitfalls and Solutions

1. Reward Hacking

Problem: The agent finds unintended ways to maximize reward that don’t align with the true objective.

Example: In a racing game, the agent might drive in circles to collect repeated checkpoint rewards instead of completing the track.

Solutions:
– Carefully design reward functions (see Episode 5 on Reward Engineering)
– Use reward shaping to guide behavior
– Add auxiliary tasks or constraints

2. Catastrophic Forgetting

Problem: On-policy algorithms forget how to handle previously mastered states when the policy changes.

Solutions:
– Use off-policy algorithms (DQN, SAC) with large replay buffers
– Implement experience replay prioritization
– Use smaller learning rates for policy updates

3. Exploration-Exploitation Dilemma

Problem: The agent gets stuck in local optima by exploiting known strategies without exploring better alternatives.

Solutions:
– For DQN: use ε-greedy with decaying ε
– For policy gradients: ensure sufficient policy entropy (use SAC or add entropy bonus)
– Use curiosity-driven exploration or intrinsic rewards

# Example: ε-greedy exploration schedule
import numpy as np

class EpsilonSchedule:
    """Exponentially decaying epsilon for exploration"""
    def __init__(self, start=1.0, end=0.01, decay_steps=10000):
        self.start = start
        self.end = end
        self.decay_steps = decay_steps

    def get_epsilon(self, step):
        epsilon = self.end + (self.start - self.end) * \
                  np.exp(-step / self.decay_steps)
        return epsilon

# Usage
schedule = EpsilonSchedule(start=1.0, end=0.01, decay_steps=10000)
for step in [0, 1000, 5000, 10000, 20000]:
    print(f"Step {step}: ε = {schedule.get_epsilon(step):.4f}")

4. Overestimation Bias

Problem: Q-learning tends to overestimate action values, leading to suboptimal policies.

Solutions:
– Use Double DQN instead of standard DQN
– Use TD3’s twin critics for continuous control
– Apply target network smoothing

Mathematical Foundations Summary

Temporal Difference Learning

TD learning updates value estimates based on other value estimates (bootstrapping):

V(st)V(st)+α[rt+γV(st+1)V(st)]V(s_t) \leftarrow V(s_t) + \alpha [r_t + \gamma V(s_{t+1}) – V(s_t)]

Where:
α\alpha is the learning rate
rt+γV(st+1)V(st)r_t + \gamma V(s_{t+1}) – V(s_t) is the TD error

Policy Gradient Theorem (Detailed)

The gradient of the expected return with respect to policy parameters is:

θJ(θ)=Esdπ,aπθ[θlogπθ(as)Qπ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^\pi, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s, a)]

Where:
dπ(s)d^\pi(s) is the stationary distribution of states under policy π\pi
Qπ(s,a)Q^\pi(s, a) is the action-value function under policy π\pi

This elegant result shows we can optimize the policy without computing the gradient of the environment dynamics—only the gradient of the policy itself.

Practical Implementation Tips

  1. Start with PPO: It’s the most robust general-purpose algorithm
  2. Use Stable Baselines3: Don’t implement from scratch unless necessary
  3. Tune hyperparameters systematically: Learning rate, batch size, and network architecture matter most
  4. Monitor training metrics: Track episode reward, value loss, policy loss, and entropy
  5. Visualize learned policies: Use env.render() or record videos to debug behavior
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
from stable_baselines3.common.monitor import Monitor

# Setup monitoring and checkpoints
env = gym.make('LunarLander-v2')
env = Monitor(env, './logs/')  # Log episode statistics

eval_callback = EvalCallback(
    env,
    best_model_save_path='./best_model/',
    log_path='./eval_logs/',
    eval_freq=5000,
    deterministic=True,
    render=False
)

checkpoint_callback = CheckpointCallback(
    save_freq=10000,
    save_path='./checkpoints/'
)

model = PPO('MlpPolicy', env, verbose=1)
model.learn(
    total_timesteps=200000,
    callback=[eval_callback, checkpoint_callback]
)

Conclusion

Choosing between policy gradient and Q-learning methods is one of the most important decisions in RL. Here’s the quick decision tree:

  • Discrete actions + simple environment → DQN/DDQN
  • Discrete actions + complex environment → PPO
  • Continuous actions + need sample efficiency → SAC
  • Continuous actions + need stability → PPO or TD3
  • Not sure → PPO (best all-around choice)

The key takeaway is that no single algorithm dominates all scenarios. Value-based methods excel at discrete action spaces with high sample efficiency. Policy gradient methods provide stability and handle continuous actions naturally. Actor-critic methods combine both strengths but add complexity.

In the next episode, we’ll dive into Stable Baselines3 in depth, covering practical tips for hyperparameter tuning, custom feature extractors, and training robust agents that generalize well. We’ll also explore advanced techniques like reward scaling, observation normalization, and curriculum learning.

By understanding the fundamental trade-offs between these algorithm families, you’re now equipped to make informed decisions about which method to apply to your specific RL problem. Remember: the best algorithm is the one that solves your problem reliably—not necessarily the one with the most citations or the flashiest name.

Deep Reinforcement Learning: From Theory to Custom Environments Series (3/6)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 390 | TOTAL 2,613