Policy Gradient Methods: PPO and A3C for Complex Game Environments

Updated Feb 6, 2026

Why Value-Based Methods Hit a Wall

DQN works brilliantly for Atari games with discrete action spaces—press left, right, jump, or fire. But what happens when your game character needs to decide how hard to accelerate (any value from 0.0 to 1.0) and which direction to steer (continuous angle from -180° to +180°) at the same time?

You could discretize the action space into a grid. Maybe 10 acceleration levels × 36 steering angles = 360 discrete actions. But now your Q-network has to output 360 values every frame, and most of those Q-values will be garbage because the state space exploded. Worse, the “optimal” action might fall between your grid points—steering at 27° when you only have 0°, 10°, 20°, 30° available.

This is where policy gradient methods shine. Instead of learning which action is best (the value), they learn how to act directly (the policy). The network outputs a probability distribution over actions, and you sample from it. For continuous actions, it might output the mean and variance of a Gaussian distribution—suddenly, your agent can steer at any angle, not just the 36 you hardcoded.

Abstract 3D render visualizing artificial intelligence and neural networks in digital form.
Photo by Google DeepMind on Pexels

The Policy Gradient Theorem (And Why It’s Not Obvious)

The core idea sounds simple: if an action led to a good outcome, increase its probability. If it led to a bad outcome, decrease it. But “good” and “bad” are relative—getting +10 reward might be terrible if you usually get +50.

The math formalizes this as the policy gradient:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)At]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right]

Here, θ\theta are your network weights, πθ(atst)\pi_\theta(a_t | s_t) is the probability of taking action ata_t in state sts_t, and AtA_t is the advantage—how much better this action was compared to the average. The gradient tells you how to adjust weights to make good actions more likely.

The trick is that logπθ(atst)\log \pi_\theta(a_t | s_t) term. Taking the log converts the probability into something differentiable that plays nicely with backpropagation. When AtA_t is positive (good outcome), the gradient pushes weights to increase logπ\log \pi, which means increasing the probability. When AtA_t is negative, it does the opposite.

But here’s the catch: this gradient estimator has massive variance. One lucky episode where you randomly tried something that worked will dominate the gradient, even if that action is usually terrible. Early policy gradient methods (REINFORCE, from 1992) were notoriously sample-inefficient—you’d need millions of episodes to train anything non-trivial.

Actor-Critic: The Hybrid That Actually Works

The variance problem comes from using the full return Gt=k=tTrkG_t = \sum_{k=t}^{T} r_k as your advantage estimate. If your episode lasts 1000 steps and reward is sparse, GtG_t is basically noise.

Actor-critic methods split the work. The actor is your policy network (“what should I do?”). The critic is a value network that estimates V(s)V(s)—the expected return from state ss under the current policy. Now your advantage becomes:

At=rt+γV(st+1)V(st)A_t = r_t + \gamma V(s_{t+1}) – V(s_t)

This is the TD error from Part 2, but used to guide the policy instead of updating Q-values directly. The critic provides a baseline that dramatically reduces variance because it’s learned from many episodes, not just the current one.

A3C (Asynchronous Advantage Actor-Critic) took this further by running multiple agents in parallel, each exploring different parts of the game. Every few steps, they send their gradients to a central network. This parallelism both speeds up training and decorrelates the data—if one agent gets stuck in a local optimum (always hiding in a corner), the others are probably doing something different.

Here’s a simplified A3C training loop (the real thing involves threading and shared memory, which is a pain):

import torch
import torch.nn as nn
import torch.optim as optim
import gym
import numpy as np

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU()
        )
        self.actor = nn.Linear(128, action_dim)  # policy logits
        self.critic = nn.Linear(128, 1)  # value estimate

    def forward(self, state):
        features = self.shared(state)
        logits = self.actor(features)
        value = self.critic(features)
        return logits, value

def compute_returns(rewards, values, gamma=0.99):
    """Compute discounted returns and advantages."""
    returns = []
    advantages = []
    R = values[-1]  # bootstrap from last value

    for r, v in zip(reversed(rewards[:-1]), reversed(values[:-1])):
        R = r + gamma * R
        returns.insert(0, R)
        advantages.insert(0, R - v)

    return returns, advantages

# Training loop (single worker, no async for clarity)
env = gym.make('CartPole-v1')
model = ActorCritic(state_dim=4, action_dim=2)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for episode in range(500):
    state, _ = env.reset()
    states, actions, rewards, values = [], [], [], []
    done = False

    # Rollout for N steps
    for step in range(200):
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        logits, value = model(state_tensor)

        probs = torch.softmax(logits, dim=-1)
        action = torch.multinomial(probs, 1).item()

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        states.append(state)
        actions.append(action)
        rewards.append(reward)
        values.append(value.item())

        state = next_state
        if done:
            break

    # Bootstrap final value
    if not done:
        with torch.no_grad():
            _, final_value = model(torch.FloatTensor(state).unsqueeze(0))
            values.append(final_value.item())
    else:
        values.append(0.0)

    returns, advantages = compute_returns(rewards, values)

    # Compute loss
    states_tensor = torch.FloatTensor(states)
    actions_tensor = torch.LongTensor(actions)
    returns_tensor = torch.FloatTensor(returns)
    advantages_tensor = torch.FloatTensor(advantages)

    logits, values_pred = model(states_tensor)
    log_probs = torch.log_softmax(logits, dim=-1)
    action_log_probs = log_probs.gather(1, actions_tensor.unsqueeze(1)).squeeze()

    actor_loss = -(action_log_probs * advantages_tensor).mean()
    critic_loss = ((values_pred.squeeze() - returns_tensor) ** 2).mean()
    loss = actor_loss + 0.5 * critic_loss  # weighted combination

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if episode % 50 == 0:
        print(f"Episode {episode}, Total Reward: {sum(rewards):.1f}, Loss: {loss.item():.3f}")

This runs on CartPole (a simple balancing task), but the structure scales to complex games. The shared feature extractor is where you’d plug in a CNN for visual input, like in Atari.

One gotcha: compute_returns() bootstraps from the last value estimate if the episode didn’t finish (line where R = values[-1]). If you forget this and always set R = 0, your agent thinks unfinished episodes are worthless and never learns long-term strategies. Took me longer than I’d like to admit to catch that bug once.

PPO: The Algorithm Everyone Actually Uses

A3C was a breakthrough in 2016, but it’s fiddly to implement (asyncio in Python? threads in PyTorch? good luck) and the parallel workers can step on each other’s gradients. PPO (Proximal Policy Optimization) arrived in 2017 and immediately became the default choice.

The core insight: policy gradient updates can be unstable. If your gradient is too large, the policy changes drastically, and suddenly your agent forgets everything and starts flailing randomly. A3C has no safeguard against this.

PPO adds a clipping mechanism to the loss function:

LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \cdot A_t, \, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot A_t \right) \right]

where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} is the probability ratio between the new policy and the old one. If rtr_t is between 1ϵ1-\epsilon and 1+ϵ1+\epsilon (typically ϵ=0.2\epsilon = 0.2), the update proceeds normally. If rtr_t drifts outside that range—meaning the policy is changing too fast—the gradient gets clipped.

This makes PPO incredibly robust. You can use larger learning rates, bigger batches, and multiple epochs per batch of data without the policy collapsing. It’s also embarrassingly parallel: collect data from N environments, shuffle it all together, run multiple SGD epochs on the combined batch.

Here’s a minimal PPO implementation (stripped down for clarity, real implementations add entropy bonuses, generalized advantage estimation, etc.):

import torch
import torch.nn as nn
import torch.optim as optim
import gym
import numpy as np

class PPOActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh()
        )
        self.actor = nn.Linear(64, action_dim)
        self.critic = nn.Linear(64, 1)

    def forward(self, state):
        features = self.shared(state)
        return self.actor(features), self.critic(features)

    def get_action(self, state):
        logits, value = self.forward(state)
        probs = torch.softmax(logits, dim=-1)
        action = torch.multinomial(probs, 1)
        log_prob = torch.log(probs.gather(1, action))
        return action.item(), log_prob, value

def ppo_update(model, optimizer, states, actions, old_log_probs, returns, advantages, clip_eps=0.2, epochs=4):
    for _ in range(epochs):
        logits, values = model(states)
        probs = torch.softmax(logits, dim=-1)
        new_log_probs = torch.log(probs.gather(1, actions.unsqueeze(1))).squeeze()

        ratio = torch.exp(new_log_probs - old_log_probs)
        clipped_ratio = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)

        actor_loss = -torch.min(ratio * advantages, clipped_ratio * advantages).mean()
        critic_loss = ((values.squeeze() - returns) ** 2).mean()

        loss = actor_loss + 0.5 * critic_loss

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)  # gradient clipping
        optimizer.step()

# Training
env = gym.make('CartPole-v1')
model = PPOActorCritic(state_dim=4, action_dim=2)
optimizer = optim.Adam(model.parameters(), lr=3e-4)

for iteration in range(100):
    states, actions, log_probs, rewards, values = [], [], [], [], []

    state, _ = env.reset()
    for _ in range(2048):  # collect batch
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        action, log_prob, value = model.get_action(state_tensor)

        next_state, reward, terminated, truncated, _ = env.step(action)

        states.append(state)
        actions.append(action)
        log_probs.append(log_prob.item())
        rewards.append(reward)
        values.append(value.item())

        state = next_state
        if terminated or truncated:
            state, _ = env.reset()

    # Compute returns and advantages (simplified)
    returns = []
    advantages = []
    R = 0
    for r in reversed(rewards):
        R = r + 0.99 * R
        returns.insert(0, R)
    returns = torch.FloatTensor(returns)
    values_tensor = torch.FloatTensor(values)
    advantages = returns - values_tensor
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)  # normalize

    states = torch.FloatTensor(states)
    actions = torch.LongTensor(actions)
    old_log_probs = torch.FloatTensor(log_probs)

    ppo_update(model, optimizer, states, actions, old_log_probs, returns, advantages)

    if iteration % 10 == 0:
        avg_reward = np.mean(rewards)
        print(f"Iteration {iteration}, Avg Reward: {avg_reward:.2f}")

The torch.nn.utils.clip_grad_norm_ call on line 54 is another stabilizer—it prevents individual gradients from exploding, which can happen when the advantage estimate is way off. Without it, you’ll occasionally see training collapse mid-run (reward suddenly drops to near-zero and never recovers).

PPO’s advantage normalization (line 72) is also crucial. Raw advantages might range from -100 to +500 depending on the game’s reward scale, which makes the clipping threshold ϵ=0.2\epsilon = 0.2 meaningless. Normalizing ensures the advantages are centered around zero with unit variance, so the clipping actually does its job.

When to Use PPO vs A3C (And When to Avoid Both)

PPO is the safe default. It’s stable, sample-efficient enough for most games, and parallelizes trivially—just run multiple gym environments and collect data from all of them. OpenAI used PPO to train Dota 2 agents that beat human pros (though they also threw 128,000 CPU cores at it for 10 months).

A3C still has a niche for memory-constrained setups. Because each worker trains independently and only sends gradient updates (not full batches of data), the central server needs less RAM. If you’re training on a cluster where network bandwidth is expensive, A3C’s smaller messages are a win. But honestly, unless you have that specific constraint, PPO is easier to debug and tune.

Neither is great for exploration-heavy games. If your game requires trying a precise sequence of 50 actions to get any reward at all (think puzzle games or strategy games with sparse objectives), policy gradients will flail. You need either:

  1. Reward shaping: Add intermediate rewards to guide the agent (“you picked up the key: +1”).
  2. Curriculum learning: Start with easier levels, gradually increase difficulty (we’ll cover this in Part 5).
  3. Hybrid methods: Combine policy gradients with Q-learning or use intrinsic motivation (curiosity-driven exploration).

I’ve also seen PPO struggle with very long episodes (10,000+ steps). The advantage estimates get noisy because you’re bootstrapping from a value function that itself is uncertain. Truncating episodes or using n-step returns can help, but at some point you might need model-based RL (learning a world model and planning within it).

Continuous Control: Where Policy Gradients Dominate

The real killer app for PPO is robotics and continuous control. Your game character needs to control 18 joint angles simultaneously? DQN would need to discretize that into a combinatorially explosive action space. PPO just outputs 18 Gaussian means and variances and samples from them.

Here’s a snippet for continuous actions (adapted for MuJoCo-style physics sims):

class ContinuousActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU()
        )
        self.actor_mean = nn.Linear(256, action_dim)
        self.actor_logstd = nn.Parameter(torch.zeros(action_dim))  # learnable std
        self.critic = nn.Linear(256, 1)

    def forward(self, state):
        features = self.shared(state)
        mean = self.actor_mean(features)
        std = torch.exp(self.actor_logstd)  # ensure positive
        value = self.critic(features)
        return mean, std, value

    def get_action(self, state):
        mean, std, value = self.forward(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(dim=-1)  # sum over action dims
        return action.numpy(), log_prob, value

The actor_logstd being a learnable parameter (not network output) is a trick from the original PPO paper. It prevents the network from collapsing the standard deviation to zero (which would make the policy deterministic and kill exploration). The docs don’t emphasize this, but if you make std a network output without constraints, training often stalls.

You can also use Beta distributions (bounded between 0 and 1, useful for throttle/steering) or squashed Gaussian (tanh of a Normal, used in SAC). The math changes slightly but the PPO loss structure stays the same.

What You Actually Need to Remember

Policy gradients learn the policy directly, not the value function. This makes them the only practical choice for continuous action spaces. A3C parallelizes training with multiple workers sending gradient updates. PPO adds clipped objective and multiple epochs per batch, making it more stable and data-efficient.

For discrete action games (most 2D games, turn-based strategy), DQN is still simpler to implement and debug. For anything with continuous actions (racing games, flight sims, robotic control), PPO is the starting point.

The real challenge isn’t the algorithm—it’s the reward function, hyperparameters, and network architecture. PPO is robust, but “robust” means “fails gracefully” not “works out of the box.” You’ll still spend hours tuning entropy coefficients, discount factors, and advantage normalization. My best guess is that 80% of RL projects fail not because the algorithm is wrong, but because the reward signal is too sparse or too dense.

Next up: multi-agent RL, where your game has multiple AI players competing or cooperating, and the optimal strategy depends on what everyone else is doing. That’s when things get properly weird.

Game AI with Reinforcement Learning Series (4/5)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 432 | TOTAL 2,655