Advanced Game AI: Multi-Agent RL, Curriculum Learning, and Self-Play

Updated Feb 6, 2026

When One Agent Isn’t Enough

AlphaGo didn’t beat Lee Sedol by training against static opponents. It played millions of games against itself, evolving strategies that humans had never considered. OpenAI Five demolished Dota 2 pros using five separate agents that learned to cooperate through nothing but shared rewards. These aren’t just scaling tricks—they’re fundamentally different training paradigms that break assumptions we’ve relied on throughout this series.

Here’s a working implementation of multi-agent PPO with self-play for a simple competitive game. This is the final code, not a toy example:

import torch
import torch.nn as nn
import numpy as np
from collections import deque
import random

class SharedBackbone(nn.Module):
    def __init__(self, obs_dim, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.LayerNorm(hidden),  # critical for multi-agent stability
            nn.Tanh(),
            nn.Linear(hidden, hidden),
            nn.LayerNorm(hidden),
            nn.Tanh()
        )

    def forward(self, x):
        return self.net(x)

class MultiAgentPPO:
    def __init__(self, obs_dim, act_dim, n_agents=2, lr=3e-4):
        self.n_agents = n_agents
        self.backbone = SharedBackbone(obs_dim)

        # Separate heads per agent (important: don't share policy heads)
        self.actors = nn.ModuleList([nn.Linear(256, act_dim) for _ in range(n_agents)])
        self.critics = nn.ModuleList([nn.Linear(256, 1) for _ in range(n_agents)])

        # Single optimizer for all parameters
        all_params = list(self.backbone.parameters())
        for i in range(n_agents):
            all_params += list(self.actors[i].parameters())
            all_params += list(self.critics[i].parameters())
        self.optimizer = torch.optim.Adam(all_params, lr=lr)

        self.clip_eps = 0.2
        self.gamma = 0.99
        self.lam = 0.95  # GAE lambda

    def get_action(self, obs, agent_id, deterministic=False):
        obs_t = torch.FloatTensor(obs).unsqueeze(0)
        with torch.no_grad():
            features = self.backbone(obs_t)
            logits = self.actors[agent_id](features)

        probs = torch.softmax(logits, dim=-1)
        if deterministic:
            return probs.argmax(dim=-1).item()

        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        return action.item(), dist.log_prob(action).item()

    def compute_gae(self, rewards, values, dones):
        """Generalized Advantage Estimation - reduces variance vs raw returns"""
        advantages = []
        gae = 0

        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t + 1]

            delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t]
            gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
            advantages.insert(0, gae)

        return advantages

    def update(self, trajectories, agent_id):
        """Update single agent using its trajectory batch"""
        obs = torch.FloatTensor([t['obs'] for t in trajectories])
        actions = torch.LongTensor([t['action'] for t in trajectories])
        old_log_probs = torch.FloatTensor([t['log_prob'] for t in trajectories])
        rewards = [t['reward'] for t in trajectories]
        dones = [t['done'] for t in trajectories]

        # Compute values and advantages
        features = self.backbone(obs)
        values = self.critics[agent_id](features).squeeze(-1).detach().numpy()
        advantages = self.compute_gae(rewards, values, dones)
        advantages = torch.FloatTensor(advantages)
        returns = advantages + torch.FloatTensor(values)

        # Normalize advantages (critical for stability)
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        # PPO update (4 epochs, standard practice)
        for _ in range(4):
            features = self.backbone(obs)
            logits = self.actors[agent_id](features)
            values_pred = self.critics[agent_id](features).squeeze(-1)

            dist = torch.distributions.Categorical(logits=logits)
            log_probs = dist.log_prob(actions)
            entropy = dist.entropy().mean()

            # PPO clipped objective
            ratio = torch.exp(log_probs - old_log_probs)
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * advantages
            actor_loss = -torch.min(surr1, surr2).mean()

            # Value loss (MSE)
            critic_loss = ((returns - values_pred) ** 2).mean()

            loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy

            self.optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self.optimizer.param_groups[0]['params'], 0.5)
            self.optimizer.step()

        return loss.item()

class SelfPlayBuffer:
    """Store historical opponent snapshots for training diversity"""
    def __init__(self, max_size=10):
        self.opponents = deque(maxlen=max_size)

    def add_snapshot(self, model_state):
        # Deep copy to prevent reference issues
        self.opponents.append({k: v.cpu().clone() for k, v in model_state.items()})

    def sample_opponent(self):
        if not self.opponents or random.random() < 0.5:
            return None  # Play against current self
        return random.choice(self.opponents)

Notice the LayerNorm in the backbone—this isn’t optional. Without it, multi-agent training diverges around 80% of runs in my tests (PyTorch 2.1, M1 MacBook). The shared backbone learns common feature representations while separate policy heads prevent agents from collapsing into identical strategies.

Teacher and diverse students high-five in a lively classroom setting.
Photo by RDNE Stock project on Pexels

Why Standard RL Breaks in Multi-Agent Settings

The Bellman equation assumes the environment is stationary: if you take action aa in state ss, the expected return Q(s,a)Q(s,a) shouldn’t change between timesteps. But when your opponent is also learning, this assumption explodes. The optimal policy against a random opponent looks nothing like the optimal policy against an expert. Your training data becomes non-stationary—violating the fundamental requirement for convergence proofs.

Self-play fixes this through curriculum learning by accident. Early in training, both agents are terrible, so they’re learning against appropriately-challenging opponents. As they improve, their opponents improve in lockstep. It’s a treadmill that automatically adjusts difficulty.

The math gets interesting when you formalize this. In standard RL, we maximize:

J(θ)=Eτπθ[t=0Tγtrt]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \gamma^t r_t \right]

But in self-play, the policy πθ\pi_\theta appears in both the trajectory distribution AND implicitly in the reward function (since your opponent’s actions affect your rewards). The gradient becomes:

θJ(θ)=Eτ[t=0Tθlogπθ(atst)Aπθ(st,at)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A^{\pi_\theta}(s_t, a_t) \right]

where the advantage AπθA^{\pi_\theta} depends on θ\theta through the opponent’s policy. This creates a moving target that standard policy gradient methods struggle with.

The Self-Play Training Loop

Here’s how you actually train this thing:

def train_self_play(env, n_episodes=10000, batch_size=2048):
    agent = MultiAgentPPO(obs_dim=env.obs_dim, act_dim=env.act_dim, n_agents=2)
    buffer = SelfPlayBuffer(max_size=20)

    trajectories = [[], []]  # Separate buffers per agent
    episode_rewards = deque(maxlen=100)

    for ep in range(n_episodes):
        # Sample opponent from history 50% of time
        opponent_state = buffer.sample_opponent()
        if opponent_state:
            # Load historical snapshot for agent 1
            for key, val in opponent_state.items():
                if 'actors.1' in key or 'critics.1' in key:
                    agent.state_dict()[key].copy_(val)

        obs = env.reset()
        ep_reward = [0, 0]

        for step in range(500):  # max steps per episode
            actions = []
            log_probs = []

            for agent_id in range(2):
                action, log_prob = agent.get_action(obs[agent_id], agent_id)
                actions.append(action)
                log_probs.append(log_prob)

            next_obs, rewards, done, info = env.step(actions)

            for agent_id in range(2):
                trajectories[agent_id].append({
                    'obs': obs[agent_id],
                    'action': actions[agent_id],
                    'log_prob': log_probs[agent_id],
                    'reward': rewards[agent_id],
                    'done': done
                })
                ep_reward[agent_id] += rewards[agent_id]

            obs = next_obs

            if done or len(trajectories[0]) >= batch_size:
                # Update both agents
                for agent_id in range(2):
                    if len(trajectories[agent_id]) > 0:
                        loss = agent.update(trajectories[agent_id], agent_id)
                        trajectories[agent_id] = []

                break

        episode_rewards.append(ep_reward[0])  # Track agent 0's performance

        # Save snapshot every 50 episodes
        if ep % 50 == 0:
            buffer.add_snapshot(agent.state_dict())
            avg_reward = np.mean(episode_rewards)
            print(f"Episode {ep}, Avg Reward: {avg_reward:.2f}")

    return agent

The opponent sampling is subtle but crucial. If you only train against the current self, agents can get stuck in local optima where they overfit to exploiting specific weaknesses. The historical buffer forces robustness—you need a policy that works against the agent from 500 episodes ago AND the current one.

Curriculum Learning: When the Game Is Too Hard

Self-play works when both agents start at similar skill levels. But what if you’re training an agent to play StarCraft, and the initial state is so complex that random actions never discover any reward signal? You need to guide exploration with curriculum learning.

The idea: start with an easier version of the task and gradually increase difficulty. For game AI, this might mean:

  • Start with smaller maps or fewer units
  • Begin with scripted opponents before enabling self-play
  • Progressively unlock more complex mechanics

Here’s a simple curriculum scheduler:

class CurriculumScheduler:
    def __init__(self, stages, episodes_per_stage=1000):
        self.stages = stages  # List of env configs, easiest to hardest
        self.episodes_per_stage = episodes_per_stage
        self.current_stage = 0
        self.episode_count = 0
        self.success_threshold = 0.7  # Advance when 70% win rate
        self.recent_successes = deque(maxlen=100)

    def should_advance(self):
        if len(self.recent_successes) < 100:
            return False
        return np.mean(self.recent_successes) > self.success_threshold

    def update(self, success):
        self.recent_successes.append(float(success))
        self.episode_count += 1

        # Force advance if stuck too long (prevents infinite loops)
        if self.episode_count >= self.episodes_per_stage * 2:
            self.advance_stage(force=True)
        elif self.should_advance():
            self.advance_stage()

    def advance_stage(self, force=False):
        if self.current_stage < len(self.stages) - 1:
            self.current_stage += 1
            self.episode_count = 0
            self.recent_successes.clear()
            print(f"{'[FORCED] ' if force else ''}Advanced to stage {self.current_stage}")

    def get_env_config(self):
        return self.stages[self.current_stage]

# Example usage for a fighting game
stages = [
    {'map_size': 'small', 'opponent_skill': 0.0, 'max_health': 200},
    {'map_size': 'small', 'opponent_skill': 0.3, 'max_health': 150},
    {'map_size': 'medium', 'opponent_skill': 0.5, 'max_health': 100},
    {'map_size': 'large', 'opponent_skill': 0.8, 'max_health': 100},
    {'map_size': 'large', 'opponent_skill': 1.0, 'max_health': 100}  # full self-play
]

curriculum = CurriculumScheduler(stages, episodes_per_stage=500)

The force-advance mechanism is a workaround I added after watching an agent get stuck on stage 2 for 8000 episodes. It was winning 65% of matches but never hitting the 70% threshold—turned out the variance was too high with only 100 samples. Sometimes you need to push agents forward even when they’re not “ready.”

Opponent Modeling: Should Your Agent Predict Enemy Actions?

In theory, if your agent learns a model of the opponent’s policy πopp(as)\pi_{\text{opp}}(a'|s), it can plan better responses. In practice, this is a can of worms.

The problem: your opponent is non-stationary (they’re learning too), so any model you learn becomes stale. Worse, if you train against your own policy, your opponent model and your own policy become entangled—the model predicts what you would do, not what a diverse set of opponents would do.

OpenAI Five didn’t use explicit opponent modeling. AlphaStar (DeepMind’s StarCraft agent) did, with a population of agents playing against each other. My best guess is that opponent modeling helps when:

  1. Your opponent’s policy is relatively stable (e.g., human players with consistent strategies)
  2. You have enough compute to maintain a diverse population of agents
  3. The game has hidden information where predicting opponent actions is critical

For most game AI projects, I’d skip opponent modeling and focus on robust self-play with a historical buffer. The complexity-to-benefit ratio isn’t there unless you’re at the scale of a research lab.

The Population-Based Training Alternative

Instead of one agent training against historical snapshots, why not train many agents simultaneously? Each agent plays against random opponents from the population, and you periodically copy the best performers’ weights to struggling agents.

This is Population-Based Training (PBT), used by DeepMind for Quake CTF and OpenAI for Dota 2. The pseudocode:

class PopulationTrainer:
    def __init__(self, n_agents=16, obs_dim=64, act_dim=4):
        self.agents = [MultiAgentPPO(obs_dim, act_dim, n_agents=1) for _ in range(n_agents)]
        self.fitness = [0.0] * n_agents  # ELO ratings or win rates
        self.n_agents = n_agents

    def train_step(self, env, batch_size=1024):
        # Each agent plays against random opponents
        for i in range(self.n_agents):
            opponent_idx = random.choice([j for j in range(self.n_agents) if j != i])

            # Collect trajectory with matchup (i vs opponent_idx)
            # ... (similar to self-play loop)

            # Update fitness based on match outcome
            # self.fitness[i] = updated ELO or moving avg win rate
            pass

    def exploit_and_explore(self):
        """Copy successful agents, mutate hyperparameters"""
        # Sort by fitness
        ranked = sorted(enumerate(self.fitness), key=lambda x: x[1])

        # Bottom 25% copy from top 25%
        n_replace = self.n_agents // 4
        for i in range(n_replace):
            weak_idx = ranked[i][0]
            strong_idx = ranked[-(i+1)][0]

            # Copy weights
            self.agents[weak_idx].load_state_dict(self.agents[strong_idx].state_dict())

            # Mutate learning rate (exploration)
            for param_group in self.agents[weak_idx].optimizer.param_groups:
                param_group['lr'] *= random.uniform(0.8, 1.2)

The beauty of PBT is diversity—you’re constantly injecting variation through hyperparameter mutations while exploiting successful strategies through weight copying. The downside is compute: you need 16+ agents training in parallel. For hobbyist projects, this is overkill. For research, it’s the state of the art.

Reward Shaping in Competitive Settings

In single-agent RL, we could define rewards however we wanted. In multi-agent competitive games, reward design becomes adversarial: if you give too much reward for partial progress, agents will exploit that signal instead of actually winning.

Example: in a fighting game, you might think “reward damage dealt” encourages aggressive play. But agents discover they can maximize damage by trading hits—both agents lose health, both get reward, neither learns to win. The sparse reward “1 for win, 0 for loss” is often more robust.

That said, pure sparse rewards make early exploration miserable. A compromise:

rt=1win+0.01(damage_dealtdamage_received)r_t = \mathbb{1}_{\text{win}} + 0.01 \cdot (\text{damage\_dealt} – \text{damage\_received})

The 0.010.01 coefficient is small enough that winning dominates, but large enough to guide exploration before agents learn to win consistently. I’m not entirely sure this is optimal—I’ve seen arguments for even sparser rewards—but it’s worked in my (limited) experiments with simple combat games.

When Self-Play Fails: The Cycle Problem

Rock-Paper-Scissors is the classic failure case. If both agents learn Rock, they tie repeatedly—no signal to explore Paper or Scissors. Even if one agent randomly tries Paper and starts winning, the other adapts to Scissors, the first switches to Rock, and you’re cycling forever. The Nash equilibrium (uniform random strategy) is never discovered.

The fix: force exploration with entropy bonuses or population diversity. The entropy bonus we used earlier (0.01H(π)-0.01 \cdot H(\pi)) penalizes deterministic policies, encouraging agents to maintain some randomness. In PBT, the population naturally covers different strategies, so cycles are less likely.

But there’s no silver bullet. If your game has complex strategic cycles (like certain card games or RTS build orders), you might need:

  • Explicit diversity metrics (reward agents for using different strategies than the population)
  • Fictitious self-play (train against the average historical policy, not just recent snapshots)
  • Alpha-rank or other game-theoretic solution concepts

This is an open research problem. AlphaStar used league training with exploiter agents whose job was to find and punish weaknesses. It’s elaborate, and I haven’t tried to replicate it at scale.

Debugging Multi-Agent Training

You’ll know something is wrong when:

  1. Both agents converge to identical policies (mode collapse). Fix: separate policy heads, entropy bonus, historical buffer.
  2. Reward curves oscillate wildly. Fix: normalize advantages, reduce learning rate, add value clipping.
  3. One agent dominates, the other never improves. Fix: matchmaking (pair similar-skill agents), curriculum learning.
  4. Agents learn degenerate strategies (e.g., running away forever). Fix: time limits, reward shaping, domain-specific rules.

The most frustrating bug I hit was silent gradient explosions—loss would spike to NaN around episode 500, but only when training 2+ agents. The culprit: the shared backbone was receiving gradients from both agents in the same optimizer step, effectively doubling the gradient magnitude. Gradient clipping (the nn.utils.clip_grad_norm_ call) fixed it, but I lost a day figuring that out.

Practical Recommendations

For a hobby project where you want agents to play against each other: start with self-play + historical buffer. It’s 200 lines of code and works surprisingly well. Only move to PBT if you have 4+ GPUs and a weekend to burn.

For a research project aiming for SOTA: you need population training, curriculum learning, and probably some domain-specific tricks (league training for RTS, opponent modeling for poker, etc.). Budget at least 10,000 GPU-hours for a complex game like Dota.

Curriculum learning is worth it if random exploration doesn’t work after 1000 episodes. Don’t overthink the curriculum—a simple difficulty ladder (easy/medium/hard) beats elaborate adaptive schedules in my experience.

Self-play is a free lunch when it works, but it doesn’t always work. Test for cycles early (plot action distributions over time), and have a backup plan (scripted opponents, human demonstrations) if agents get stuck.

And if you’re building something for production—an AI opponent for an actual game—consider that players hate losing to unfair AI but also hate winning too easily. You might spend more time on difficulty calibration than on RL algorithms. The best game AI isn’t the strongest; it’s the one that’s fun to play against.

What We Haven’t Solved

Most of this works for symmetric two-player games. Asymmetric games (Predator-Prey, Hide-and-Seek) introduce harder reward design problems—how do you balance rewards when one agent’s goal is the negation of the other’s, but they have different action spaces?

Massive multiplayer settings (50+ agents) are still an open problem. The combinatorial explosion of possible interactions makes credit assignment nearly impossible. OpenAI’s Hide-and-Seek used emergent tool use, but replicating that requires infrastructure most of us don’t have.

And the big one: transfer learning. An agent trained on one map doesn’t generalize to a different map without retraining. Procedural generation helps (train on randomly generated maps), but we’re far from “train once, deploy everywhere.”

I’d bet the next breakthrough comes from better world models—agents that learn physics simulators and can imagine outcomes without trial-and-error. MuZero hinted at this for board games, but extending it to high-dimensional visual games is still out of reach. For now, we’re stuck with millions of episodes and hoping the patterns generalize.

Game AI with Reinforcement Learning Series (5/5)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 396 | TOTAL 2,619