Part 1: The Core of RL: Markov Decision Processes (MDP) Explained

Updated Feb 6, 2026

Introduction

Reinforcement Learning (RL) has revolutionized fields ranging from robotics to game playing, from autonomous driving to financial trading. At the heart of every RL algorithm—whether it’s DQN, PPO, or SAC—lies a mathematical framework called the Markov Decision Process (MDP). Understanding MDPs is not optional; it’s the foundation upon which all modern deep RL is built.

In this first episode of our series, we’ll demystify the MDP framework from the ground up. We’ll cover states, actions, transition probabilities, rewards, and the discount factor. We’ll derive the famous Bellman equations step-by-step, explore value functions and policies, and implement classic dynamic programming algorithms in Python. By the end, you’ll understand why MDPs matter and how they connect to the deep RL methods we’ll explore in later episodes.

What is a Markov Decision Process?

The Formal Framework

An MDP is defined by a tuple (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma):

  • S\mathcal{S}: State space (set of all possible states)
  • A\mathcal{A}: Action space (set of all possible actions)
  • PP: Transition probability function P(ss,a)P(s' | s, a), the probability of reaching state ss' from state ss after taking action aa
  • RR: Reward function R(s,a,s)R(s, a, s') or R(s,a)R(s, a), the immediate reward received
  • γ\gamma: Discount factor γ[0,1]\gamma \in [0, 1], controlling the importance of future rewards

The Markov Property

The “Markov” in MDP refers to the Markov property: the future depends only on the current state, not on the history of past states. Formally:

P(st+1st,at,st1,at1,,s0,a0)=P(st+1st,at)P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0) = P(s_{t+1} | s_t, a_t)

This memoryless property is crucial—it allows us to reason about optimal decisions using only the current state, dramatically simplifying the mathematics and computation.

Example: Grid World

Consider a 4×4 grid where an agent starts at the top-left corner and tries to reach a goal at the bottom-right corner:

S · · ·
· · X ·
· · · ·
· · · G
  • States: Each cell position (16 total states)
  • Actions: {Up, Down, Left, Right}
  • Transitions: Deterministic movement (or stochastic with 80% success rate)
  • Rewards: -1 per step, +10 at goal, -10 at obstacle (X)
  • Discount: γ=0.9\gamma = 0.9

This simple environment encapsulates all MDP components and will serve as our running example.

Policies: The Agent’s Strategy

Deterministic vs Stochastic Policies

A policy π\pi defines the agent’s behavior—how it chooses actions given states.

Deterministic policy: π(s)=a\pi(s) = a (one action per state)

Stochastic policy: π(as)\pi(a|s) (probability distribution over actions)

For example:
– Deterministic: “Always go right in state (0,0)”
– Stochastic: “Go right with 70% probability, down with 30% in state (0,0)”

Deep RL algorithms often learn stochastic policies (e.g., policy gradient methods) or deterministic policies (e.g., DDPG).

Policy Evaluation

Given a policy π\pi, we want to know: How good is this policy? This leads us to value functions.

Value Functions: Measuring Policy Quality

State-Value Function Vπ(s)V^\pi(s)

The state-value function Vπ(s)V^\pi(s) is the expected cumulative discounted reward starting from state ss and following policy π\pi:

Vπ(s)=Eπ[t=0γtRt+1S0=s]V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s \right]

Where:
Eπ\mathbb{E}_\pi denotes expectation under policy π\pi
Rt+1R_{t+1} is the reward at time step t+1t+1
γt\gamma^t exponentially decays future rewards

Action-Value Function Qπ(s,a)Q^\pi(s, a)

The action-value function (Q-function) extends this concept:

Qπ(s,a)=Eπ[t=0γtRt+1S0=s,A0=a]Q^\pi(s, a) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s, A_0 = a \right]

Qπ(s,a)Q^\pi(s, a)

is the expected return from state , taking action first, then following .

Relationship: Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a) (for stochastic policies)

Why the Discount Factor?

The discount factor γ\gamma serves three purposes:

  1. Mathematical convergence: Ensures infinite-horizon sums converge
  2. Uncertainty modeling: Future rewards are less certain, so we value them less
  3. Control urgency: γ\gamma close to 0 = myopic (greedy), close to 1 = far-sighted

Example: With γ=0.9\gamma = 0.9, a reward 10 steps away is worth only 0.9100.350.9^{10} \approx 0.35 of an immediate reward.

The Bellman Equations: Recursive Relationships

Bellman Expectation Equations

The value functions satisfy recursive relationships called Bellman equations. Let’s derive the Bellman expectation equation for Vπ(s)V^\pi(s):

Starting from the definition:

Vπ(s)=Eπ[Rt+1+γRt+2+γ2Rt+3+St=s]V^\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \mid S_t = s]

Separate the immediate reward from future rewards:

Vπ(s)=Eπ[Rt+1St=s]+γEπ[Rt+2+γRt+3+St=s]V^\pi(s) = \mathbb{E}_\pi [R_{t+1} \mid S_t = s] + \gamma \mathbb{E}_\pi [R_{t+2} + \gamma R_{t+3} + \ldots \mid S_t = s]

The future part is just Vπ(St+1)V^\pi(S_{t+1}):

Vπ(s)=Eπ[Rt+1+γVπ(St+1)St=s]V^\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]

Expanding the expectation over actions and next states:

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^\pi(s')]

This is the Bellman expectation equation for VπV^\pi.

Similarly, for Qπ(s,a)Q^\pi(s,a):

Qπ(s,a)=sP(ss,a)[R(s,a,s)+γaπ(as)Qπ(s,a)]Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')\right]

Bellman Optimality Equations

The optimal value function V(s)V^*(s) is the maximum value achievable from state ss under any policy:

V(s)=maxπVπ(s)V^*(s) = \max_\pi V^\pi(s)

The Bellman optimality equation is:

V(s)=maxasP(ss,a)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')]

For Q(s,a)Q^*(s,a):

Q(s,a)=sP(ss,a)[R(s,a,s)+γmaxaQ(s,a)]Q^*(s,a) = \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]

These equations are fixed-point equations—the optimal value function is the solution that satisfies these recursive relationships.

Key insight: If we know VV^* or QQ^*, we can extract the optimal policy:

π(s)=argmaxasP(ss,a)[R(s,a,s)+γV(s)]\pi^*(s) = \arg\max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')]

Or more simply with QQ^*:

π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s,a)

Dynamic Programming: Solving MDPs

When we know the full MDP model (P,R)(P, R), we can solve for optimal policies using dynamic programming. Two classic algorithms: value iteration and policy iteration.

Value Iteration

Value iteration iteratively applies the Bellman optimality equation as an update rule:

Vk+1(s)=maxasP(ss,a)[R(s,a,s)+γVk(s)]V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V_k(s')]

Start with arbitrary V0V_0 (e.g., all zeros) and repeat until convergence.

Python Implementation

import numpy as np

# 4x4 Grid World MDP
class GridWorld:
    def __init__(self):
        self.size = 4
        self.n_states = self.size * self.size
        self.actions = ['up', 'down', 'left', 'right']
        self.n_actions = len(self.actions)
        self.goal_state = 15  # Bottom-right corner
        self.obstacle_state = 6  # Obstacle at (1, 2)
        self.gamma = 0.9

    def get_next_state(self, state, action):
        """Deterministic transition function"""
        row, col = state // self.size, state % self.size

        if action == 'up':
            row = max(0, row - 1)
        elif action == 'down':
            row = min(self.size - 1, row + 1)
        elif action == 'left':
            col = max(0, col - 1)
        elif action == 'right':
            col = min(self.size - 1, col + 1)

        return row * self.size + col

    def get_reward(self, state, action, next_state):
        """Reward function"""
        if next_state == self.goal_state:
            return 10.0
        elif next_state == self.obstacle_state:
            return -10.0
        else:
            return -1.0  # Step penalty

def value_iteration(env, theta=1e-6, max_iterations=1000):
    """Value iteration algorithm"""
    V = np.zeros(env.n_states)

    for iteration in range(max_iterations):
        delta = 0
        V_old = V.copy()

        for s in range(env.n_states):
            if s == env.goal_state:  # Terminal state
                V[s] = 0
                continue

            # Compute max over actions
            action_values = []
            for action in env.actions:
                s_prime = env.get_next_state(s, action)
                reward = env.get_reward(s, action, s_prime)
                action_values.append(reward + env.gamma * V_old[s_prime])

            V[s] = max(action_values)
            delta = max(delta, abs(V[s] - V_old[s]))

        if delta < theta:
            print(f"Converged in {iteration + 1} iterations")
            break

    # Extract optimal policy
    policy = np.zeros(env.n_states, dtype=int)
    for s in range(env.n_states):
        if s == env.goal_state:
            continue

        action_values = []
        for action_idx, action in enumerate(env.actions):
            s_prime = env.get_next_state(s, action)
            reward = env.get_reward(s, action, s_prime)
            action_values.append(reward + env.gamma * V[s_prime])

        policy[s] = np.argmax(action_values)

    return V, policy

# Run value iteration
env = GridWorld()
V_star, pi_star = value_iteration(env)

print("\nOptimal Value Function:")
print(V_star.reshape(4, 4))

print("\nOptimal Policy (0=up, 1=down, 2=left, 3=right):")
print(pi_star.reshape(4, 4))

Output interpretation: The value function shows expected cumulative rewards from each state. States closer to the goal have higher values. The policy shows the optimal action at each state.

Policy Iteration

Policy iteration alternates between two steps:

  1. Policy Evaluation: Compute VπV^\pi for current policy π\pi by solving the Bellman expectation equation
  2. Policy Improvement: Update policy greedily: π(s)=argmaxasP(ss,a)[R(s,a,s)+γVπ(s)]\pi'(s) = \arg\max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^\pi(s')]

Repeat until policy converges.

def policy_evaluation(env, policy, theta=1e-6):
    """Evaluate a given policy"""
    V = np.zeros(env.n_states)

    while True:
        delta = 0
        V_old = V.copy()

        for s in range(env.n_states):
            if s == env.goal_state:
                V[s] = 0
                continue

            action = env.actions[policy[s]]
            s_prime = env.get_next_state(s, action)
            reward = env.get_reward(s, action, s_prime)
            V[s] = reward + env.gamma * V_old[s_prime]

            delta = max(delta, abs(V[s] - V_old[s]))

        if delta < theta:
            break

    return V

def policy_iteration(env, max_iterations=100):
    """Policy iteration algorithm"""
    # Start with random policy
    policy = np.random.randint(0, env.n_actions, env.n_states)

    for iteration in range(max_iterations):
        # Policy Evaluation
        V = policy_evaluation(env, policy)

        # Policy Improvement
        policy_stable = True
        for s in range(env.n_states):
            if s == env.goal_state:
                continue

            old_action = policy[s]

            # Greedy policy update
            action_values = []
            for action in env.actions:
                s_prime = env.get_next_state(s, action)
                reward = env.get_reward(s, action, s_prime)
                action_values.append(reward + env.gamma * V[s_prime])

            policy[s] = np.argmax(action_values)

            if old_action != policy[s]:
                policy_stable = False

        if policy_stable:
            print(f"Policy converged in {iteration + 1} iterations")
            break

    return V, policy

# Run policy iteration
V_pi, pi_pi = policy_iteration(env)
print("\nPolicy Iteration - Optimal Value Function:")
print(V_pi.reshape(4, 4))

Comparison: Policy iteration often converges in fewer iterations than value iteration because it explicitly maintains and improves a policy. Value iteration implicitly improves the policy through value updates.

Example: Simple Trading MDP

Let’s model a simplified trading scenario:

  • States: {Cash, Stock} (binary: holding cash or holding stock)
  • Actions: {Hold, Buy, Sell}
  • Transitions: Stock price moves up/down with probabilities
  • Rewards: Profit/loss from trading
class TradingMDP:
    def __init__(self):
        self.states = ['cash', 'stock']
        self.actions = ['hold', 'buy', 'sell']
        self.gamma = 0.95

        # Price change probabilities
        self.p_up = 0.6  # Probability price goes up
        self.p_down = 0.4

        # Price change magnitude
        self.price_change = 0.02  # 2% change

    def transition_prob(self, s, a, s_prime):
        """Return P(s' | s, a)"""
        if a == 'hold':
            return 1.0 if s == s_prime else 0.0
        elif a == 'buy' and s == 'cash':
            return 1.0 if s_prime == 'stock' else 0.0
        elif a == 'sell' and s == 'stock':
            return 1.0 if s_prime == 'cash' else 0.0
        else:
            return 0.0  # Invalid action

    def reward(self, s, a):
        """Expected reward for action a in state s"""
        if s == 'stock' and a == 'hold':
            # Expected return from price movement
            return self.p_up * self.price_change - self.p_down * self.price_change
        elif s == 'stock' and a == 'sell':
            # Lock in current value
            return 0.0
        elif s == 'cash' and a == 'buy':
            # Transaction cost
            return -0.001
        else:
            return 0.0

# Solve with value iteration
trading_env = TradingMDP()
print("\nSimple Trading MDP - demonstrates MDP modeling in finance")

This example illustrates how real-world decision problems (trading, resource allocation) can be formalized as MDPs.

Why MDPs Matter for Deep RL

You might wonder: “Modern RL uses neural networks, not dynamic programming. Why learn MDPs?”

The Connection

Every deep RL algorithm is solving an MDP:

  • DQN (Deep Q-Network): Approximates Q(s,a)Q^*(s,a) with a neural network
  • Policy Gradient methods (PPO, A3C): Directly optimize π(as)\pi(a|s) to maximize Vπ(s)V^\pi(s)
  • Actor-Critic (SAC, TD3): Learns both policy and value function

The difference: deep RL handles unknown MDPs (we don’t know PP or RR) and large/continuous state spaces where tabular methods fail.

Key Takeaways for Deep RL

  1. Bellman equations are everywhere: Temporal difference (TD) learning, Q-learning, and actor-critic all use Bellman-style updates
  2. Value functions are universal: Whether tabular or neural, VV and QQ measure policy quality
  3. Policy optimization: The goal is always to find π\pi^* that maximizes expected return
  4. Discount factor matters: Hyperparameter γ\gamma critically affects learning dynamics

In upcoming episodes, we’ll see how these MDP concepts translate to custom Gym environments (Episode 2) and deep RL algorithms (Episodes 3-4).

Practical Considerations

When is the MDP Framework Applicable?

MDPs require:
Markov property: Current state contains all necessary information
Stationary dynamics: Transition probabilities don’t change over time
Discrete time steps: Decisions occur at fixed intervals

Many real problems violate these assumptions (partial observability, non-stationarity, continuous time), but MDP solutions still provide strong baselines.

Curse of Dimensionality

Dynamic programming becomes intractable for large state spaces:
– Grid world: 16 states ✓
– Chess: ~10^43 states ✗
– Robotic control: continuous state space ✗

This is why deep RL uses function approximation (neural networks) instead of tabular methods.

Exploration vs Exploitation

MDPs assume we know the model (P,R)(P, R). In practice, we must explore to learn the environment while exploiting current knowledge. This tradeoff becomes central in model-free RL (covered in Episode 3).

Conclusion

Markov Decision Processes are the mathematical bedrock of reinforcement learning. We’ve covered the formal framework (states, actions, transitions, rewards, discount factor), derived the Bellman equations from first principles, and implemented value iteration and policy iteration in Python.

Key concepts to remember:
Value functions Vπ(s)V^\pi(s) and Qπ(s,a)Q^\pi(s,a) quantify policy quality
Bellman equations provide recursive relationships that enable efficient computation
Dynamic programming solves MDPs exactly when the model is known
Optimal policy can be extracted from optimal value functions

These foundations will carry through the entire series. In the next episode, we’ll move from theory to practice by building custom Gym environments where our agents will learn through interaction—no model required. We’ll see how the abstract MDP notation translates into concrete Python classes with step() and reset() methods.

Understanding MDPs deeply will make every deep RL algorithm you encounter—from DQN to PPO—feel less like magic and more like principled optimization. The mathematics we’ve covered today isn’t just academic; it’s the blueprint that guides how neural networks learn to play games, control robots, and optimize complex systems.

ssaaπ\pi

Deep Reinforcement Learning: From Theory to Custom Environments Series (1/6)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 390 | TOTAL 2,613