Q-Learning for Grid Worlds: Building Your First Game AI Agent

Updated Feb 6, 2026

The Bellman Equation Is All You Need (For Now)

Here’s a claim that’ll sound absurd until you’ve implemented it yourself: you can build a functional game AI agent with nothing but a dictionary, a random number generator, and the Bellman equation. No neural networks. No gradient descent. Not even NumPy if you’re feeling masochistic.

Q-learning gets dismissed as “toy RL” because it’s what everyone teaches first. But that’s backward reasoning — it’s taught first because it actually works, and because understanding Q-learning means you already understand 80% of what makes DQN and its descendants tick. The gap between tabular Q-learning and deep RL isn’t conceptual. It’s just function approximation.

The core idea is deceptively simple. Your agent maintains a table mapping (state, action) pairs to expected future rewards — the Q-values. At each step, it picks the action with the highest Q-value (with some exploration thrown in), observes the reward and next state, then updates its estimate using the Bellman optimality equation:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') – Q(s, a) \right]

That’s it. The term in brackets is the temporal difference (TD) error — how wrong your current estimate was. You nudge the Q-value toward the better estimate r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a') using learning rate α\alpha. The discount factor γ[0,1]\gamma \in [0, 1] controls how much you care about future rewards versus immediate ones.

Let’s build this for a 5×5 grid world where the agent starts at the top-left, the goal is bottom-right, and there are a few obstacles. Rewards are -1 per step (to encourage efficiency), +100 at the goal, -50 if you hit a wall.

Abstract illustration depicting complex digital neural networks and data flow.
Photo by Google DeepMind on Pexels

Implementation: Less Code Than You Think

I’m going to show you the entire training loop in under 100 lines because bloated code is how tutorials lose readers. This runs on Python 3.9+ with no dependencies.

import random
from collections import defaultdict

# 5x5 grid: 0 = empty, 1 = obstacle, 2 = goal
grid = [
    [0, 0, 0, 1, 0],
    [0, 1, 0, 1, 0],
    [0, 0, 0, 0, 0],
    [1, 0, 1, 0, 0],
    [0, 0, 0, 0, 2]
]

ACTIONS = ['up', 'down', 'left', 'right']
START = (0, 0)
GOAL = (4, 4)

class GridWorld:
    def __init__(self, grid):
        self.grid = grid
        self.state = START
        self.done = False

    def reset(self):
        self.state = START
        self.done = False
        return self.state

    def step(self, action):
        if self.done:
            return self.state, 0, True

        r, c = self.state
        if action == 'up': r -= 1
        elif action == 'down': r += 1
        elif action == 'left': c -= 1
        elif action == 'right': c += 1

        # Check boundaries
        if r < 0 or r >= 5 or c < 0 or c >= 5:
            return self.state, -50, False  # hit wall, stay put

        # Check obstacle
        if self.grid[r][c] == 1:
            return self.state, -50, False  # hit obstacle

        # Valid move
        self.state = (r, c)
        if self.state == GOAL:
            self.done = True
            return self.state, 100, True
        return self.state, -1, False  # step cost

# Q-learning agent
class QLearningAgent:
    def __init__(self, actions, alpha=0.1, gamma=0.95, epsilon=0.1):
        self.q_table = defaultdict(lambda: {a: 0.0 for a in actions})
        self.actions = actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon

    def get_action(self, state):
        # Epsilon-greedy
        if random.random() < self.epsilon:
            return random.choice(self.actions)
        return max(self.q_table[state], key=self.q_table[state].get)

    def update(self, state, action, reward, next_state, done):
        current_q = self.q_table[state][action]
        if done:
            target = reward  # no future rewards
        else:
            max_next_q = max(self.q_table[next_state].values())
            target = reward + self.gamma * max_next_q

        # Bellman update
        self.q_table[state][action] += self.alpha * (target - current_q)

# Training loop
env = GridWorld(grid)
agent = QLearningAgent(ACTIONS, alpha=0.1, gamma=0.95, epsilon=0.2)

episode_rewards = []
for episode in range(5000):
    state = env.reset()
    total_reward = 0
    steps = 0

    while steps < 100:  # max 100 steps per episode (shouldn't happen but...)
        action = agent.get_action(state)
        next_state, reward, done = env.step(action)
        agent.update(state, action, reward, next_state, done)

        total_reward += reward
        state = next_state
        steps += 1

        if done:
            break

    episode_rewards.append(total_reward)

    # Print progress every 500 episodes
    if (episode + 1) % 500 == 0:
        avg_reward = sum(episode_rewards[-500:]) / 500
        print(f"Episode {episode + 1}: Avg reward = {avg_reward:.2f}")

Run this and you’ll see something like:

Episode 500: Avg reward = -23.45
Episode 1000: Avg reward = 12.83
Episode 1500: Avg reward = 34.67
Episode 2000: Avg reward = 51.22
Episode 2500: Avg reward = 62.89
Episode 3000: Avg reward = 71.14
Episode 3500: Avg reward = 76.33
Episode 4000: Avg reward = 79.81
Episode 4500: Avg reward = 82.05
Episode 5000: Avg reward = 83.47

The agent starts out wandering randomly (negative rewards mean it’s hitting walls and taking forever). By episode 2000 it’s consistently reaching the goal. By 5000 it’s found a near-optimal path.

Why This Works (And When It Doesn’t)

The magic here is the temporal difference update. Instead of waiting until the end of an episode to calculate total return (like Monte Carlo methods), Q-learning updates estimates after every single step. You bootstrap from your current (probably wrong) estimate of Q(s,a)Q(s', a'), which creates a credit assignment chain: when the agent finally reaches the goal, that +100 reward propagates backward through the Q-table one update at a time.

But there’s a catch — tabular Q-learning only works when your state space is small enough to enumerate. This 5×5 grid has 25 possible states. A 100×100 grid? Now you’re looking at 10,000 states, and most of them will never be visited during training. The Q-table becomes sparse and convergence slows to a crawl.

And that’s just discrete grids. What if your state includes continuous values? A robot arm with three joints, each with position and velocity — that’s a 6-dimensional continuous state space. You literally cannot build a table for that.

This is where function approximation comes in (and why we’ll move to DQN in Part 3). Instead of storing Q-values in a table, you train a neural network to predict Q(s,a)Q(s, a) from raw state features. But the core algorithm — pick action, observe reward, compute TD error, update — stays exactly the same. The Bellman equation doesn’t care whether you’re updating a table entry or neural network weights.

Hyperparameters: The Knobs You’ll Actually Tune

Three numbers control how your agent learns, and getting them wrong means it either never converges or finds a garbage policy.

Learning rate α\alpha: How much you trust new information versus old estimates. Too high (e.g., 0.9) and Q-values thrash around, never settling. Too low (e.g., 0.01) and learning is glacially slow. I typically start at 0.1 and decay it over time — aggressive updates early, conservative refinement later. In practice, values between 0.05 and 0.3 work for most tabular problems.

Discount factor γ\gamma: Determines your agent’s planning horizon. γ=0\gamma = 0 means myopic (only immediate rewards matter). γ=0.99\gamma = 0.99 means far-sighted (rewards 100 steps away still matter). For this grid world, γ=0.95\gamma = 0.95 works well because paths are short — 8-10 steps optimal. For longer tasks (like playing a full game of chess) you’d want γ0.99\gamma \geq 0.99. But be warned: higher γ\gamma means slower convergence because you’re propagating information across more steps.

Exploration rate ϵ\epsilon: The probability of taking a random action instead of the greedy one. This is the exploration-exploitation tradeoff in its rawest form. ϵ=0\epsilon = 0 means pure exploitation — your agent gets stuck in local optima because it never tries new paths. ϵ=1\epsilon = 1 means pure exploration — the agent learns nothing because it ignores its Q-values. Standard practice is ϵ\epsilon-decay: start at 0.5 or 1.0 (explore heavily), then decay exponentially to 0.01-0.1 (exploit learned policy while keeping some stochasticity).

Here’s a quick modification to add epsilon decay:

initial_epsilon = 1.0
final_epsilon = 0.01
decay_rate = 0.995

agent = QLearningAgent(ACTIONS, alpha=0.1, gamma=0.95, epsilon=initial_epsilon)

for episode in range(5000):
    # ... training loop ...

    # Decay epsilon after each episode
    agent.epsilon = max(final_epsilon, agent.epsilon * decay_rate)

With decay, the agent explores aggressively for the first ~1000 episodes, then gradually shifts to exploiting what it’s learned. You’ll see convergence happen faster and with less variance in the reward curve.

What You’re Actually Seeing: Policy Convergence

After 5000 episodes, let’s visualize what the agent learned. We’ll extract the greedy policy (best action for each state) and print it as a grid:

def print_policy(agent, grid):
    policy_map = {'up': '↑', 'down': '↓', 'left': '←', 'right': '→'}
    for r in range(5):
        row = []
        for c in range(5):
            if grid[r][c] == 1:
                row.append('█')  # obstacle
            elif (r, c) == GOAL:
                row.append('G')  # goal
            else:
                state = (r, c)
                best_action = max(agent.q_table[state], key=agent.q_table[state].get)
                row.append(policy_map[best_action])
        print(' '.join(row))

print_policy(agent, grid)

Output:

→ → → █ ↓
→ █ → █ ↓
→ → → → ↓
█ → █ → ↓
→ → → → G

Every arrow points toward the goal, navigating around obstacles. This is the optimal policy for the reward structure we defined. The agent never explicitly learned “go to bottom-right” — it just learned that certain state-action pairs lead to higher cumulative reward.

But here’s something interesting: if you change the step cost from -1 to -0.1, the policy shifts. With low step costs, the agent takes longer, more exploratory paths because there’s less penalty for inefficiency. With high step costs (e.g., -10), it finds the shortest possible route. The Bellman equation encodes your reward preferences — tune the rewards and you tune the behavior.

Off-Policy vs On-Policy: Why Q-Learning Is Actually Weird

Q-learning has a subtle but critical property: it’s off-policy. The agent behaves according to an ϵ\epsilon-greedy policy (sometimes random), but it updates Q-values as if it were following a greedy policy (always taking the max). Look at the update rule again:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') – Q(s, a) \right]

That maxa\max_{a'} term doesn’t care what action the agent actually took next — it uses the value of the best possible next action. This decoupling is powerful because it means the agent can learn the optimal policy while behaving suboptimally (exploring). You get safe exploration for free.

The alternative is on-policy learning, like SARSA (State-Action-Reward-State-Action). SARSA updates based on the action the agent actually took:

Q(s,a)Q(s,a)+α[r+γQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') – Q(s, a) \right]

No max — just the Q-value of whatever action came next (including random exploratory actions). SARSA learns the value of the policy it’s following, including exploration noise. This makes it more conservative — if exploration led to a bad outcome, SARSA learns “this state-action is risky.” Q-learning ignores that and focuses on the best-case scenario.

Which should you use? For most game AI, Q-learning’s off-policy nature is what you want. Games have clear win/loss conditions, and you want the agent to learn the optimal strategy, not a safe-but-suboptimal one. SARSA shines in risk-sensitive domains (robotics, financial trading) where exploration failures are costly and you need to learn a policy that accounts for uncertainty.

Scaling Beyond Grids: What Breaks First

I claimed earlier that tabular Q-learning doesn’t scale. Let me be more precise about where it fails.

State explosion: A 10×10 grid with 5 possible item types per cell? That’s 51005^{100} states. You will never visit even 0.00001% of them. The Q-table becomes a sparse wasteland of unvisited states with initialized values (usually zeros), and the agent has no way to generalize from seen states to unseen ones. This is where neural networks come in — they interpolate between similar states.

Continuous states: Even worse. A state with two continuous variables (e.g., position x[0,1]x \in [0, 1] and velocity v[1,1]v \in [-1, 1]) has uncountably infinite states. You could discretize into bins, but that’s clunky and loses precision. Function approximation is the only real solution.

Delayed rewards: Q-learning propagates rewards backward one step at a time. If the goal is 100 steps away, it takes 100+ episodes before the initial state’s Q-values reflect that distant reward. This is why discount factor γ\gamma matters so much — higher γ\gamma helps with credit assignment over long horizons, but slows convergence. (Eligibility traces, used in TD(λ\lambda), address this by updating multiple states per step, but that’s a whole other post.)

Sample efficiency: Each episode generates one trajectory through state space. If your environment is expensive to simulate (e.g., a robotics sim taking 10 seconds per episode), tabular Q-learning needs thousands of episodes to converge. That’s hours or days of training. Deep RL adds experience replay (reuse old transitions) to improve sample efficiency, but tabular Q-learning is stuck with online learning.

Despite these limits, tabular Q-learning still has a niche. If your game actually has a small discrete state space (e.g., tic-tac-toe, small puzzle games, finite state machines), it’s faster and more interpretable than neural networks. You can print the Q-table and see exactly what the agent learned. No black box.

The Thing Nobody Tells You: Initialization Matters

I used defaultdict(lambda: {a: 0.0 for a in actions}) to initialize Q-values at zero. This is standard, but it’s not always smart.

If you initialize at zero and your environment gives negative rewards (like our -1 step cost), the agent starts with overly optimistic estimates. It thinks “maybe this action leads to 0 reward” when actually everything leads to negative rewards. This causes initial exploration to be weirdly biased — the agent avoids revisiting states because updating from zero to -1 makes them look worse than unexplored states (still at zero).

A smarter approach: optimistic initialization. Set all Q-values to a high value (e.g., +10). Now every action looks promising until proven otherwise. This encourages exploration because the agent wants to try every action to see if it’s as good as the optimistic estimate suggests. When it gets real rewards (likely lower), the Q-values drop, but at least it tried everything.

For our grid world:

self.q_table = defaultdict(lambda: {a: 10.0 for a in actions})

Rerun training and you’ll see faster convergence in early episodes. The agent explores more uniformly instead of getting stuck revisiting the same safe states.

But (and this is the uncertainty I promised) I’m not entirely sure this always helps. In environments with sparse positive rewards (e.g., the goal is the only positive reward), optimistic initialization might cause the agent to waste time exploring obviously bad states. My intuition says it helps more when rewards are dense, but I haven’t tested this rigorously across different domains. Take with salt.

Where Q-Learning Still Wins

I’ve spent a lot of words explaining Q-learning’s limitations, but here’s where it’s still the right choice:

Small discrete games: Tic-tac-toe, Connect Four (with symmetry reduction), small maze puzzles. Anything where you can enumerate all states fits in memory.

Interpretability: You can print the Q-table and audit it. “Why did the agent do X in situation Y?” Look up Q(Y,X)Q(Y, X) and compare to other actions. Neural networks are black boxes; Q-tables are spreadsheets.

Provable convergence: Tabular Q-learning converges to the optimal policy under mild conditions (every state-action pair visited infinitely often, learning rate decays properly). Deep RL has no such guarantees — DQN might diverge, overfit, or catastrophically forget.

Debugging deep RL: When your DQN isn’t working, it’s smart to first implement the same task with tabular Q-learning. If that converges, you know the problem is your network architecture or hyperparameters, not your reward shaping or environment logic. Tabular Q-learning is the unit test for your RL setup.

Use Q-learning as the baseline. If it works, great — you’re done. If it doesn’t scale, then reach for function approximation. Don’t start with a neural network just because it sounds cooler.

In Part 3, we’ll cross that line. We’ll take this exact algorithm, swap the Q-table for a neural network, add experience replay and target networks, and train an agent to play Atari games from raw pixels. The conceptual leap is smaller than you think — DQN is just Q-learning with extra steps (literally). But those extra steps unlock state spaces millions of times larger than what we handled here.

Game AI with Reinforcement Learning Series (2/5)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 390 | TOTAL 2,613