Introduction
Reinforcement Learning (RL) has revolutionized fields ranging from robotics to game playing, from autonomous driving to financial trading. At the heart of every RL algorithm—whether it’s DQN, PPO, or SAC—lies a mathematical framework called the Markov Decision Process (MDP). Understanding MDPs is not optional; it’s the foundation upon which all modern deep RL is built.
In this first episode of our series, we’ll demystify the MDP framework from the ground up. We’ll cover states, actions, transition probabilities, rewards, and the discount factor. We’ll derive the famous Bellman equations step-by-step, explore value functions and policies, and implement classic dynamic programming algorithms in Python. By the end, you’ll understand why MDPs matter and how they connect to the deep RL methods we’ll explore in later episodes.
What is a Markov Decision Process?
The Formal Framework
An MDP is defined by a tuple :
- : State space (set of all possible states)
- : Action space (set of all possible actions)
- : Transition probability function , the probability of reaching state from state after taking action
- : Reward function or , the immediate reward received
- : Discount factor , controlling the importance of future rewards
The Markov Property
The “Markov” in MDP refers to the Markov property: the future depends only on the current state, not on the history of past states. Formally:
This memoryless property is crucial—it allows us to reason about optimal decisions using only the current state, dramatically simplifying the mathematics and computation.
Example: Grid World
Consider a 4×4 grid where an agent starts at the top-left corner and tries to reach a goal at the bottom-right corner:
| S | · | · | · |
|---|---|---|---|
| · | · | X | · |
| · | · | · | · |
| · | · | · | G |
- States: Each cell position (16 total states)
- Actions: {Up, Down, Left, Right}
- Transitions: Deterministic movement (or stochastic with 80% success rate)
- Rewards: -1 per step, +10 at goal, -10 at obstacle (X)
- Discount:
This simple environment encapsulates all MDP components and will serve as our running example.
Policies: The Agent’s Strategy
Deterministic vs Stochastic Policies
A policy defines the agent’s behavior—how it chooses actions given states.
Deterministic policy: (one action per state)
Stochastic policy: (probability distribution over actions)
For example:
– Deterministic: “Always go right in state (0,0)”
– Stochastic: “Go right with 70% probability, down with 30% in state (0,0)”
Deep RL algorithms often learn stochastic policies (e.g., policy gradient methods) or deterministic policies (e.g., DDPG).
Policy Evaluation
Given a policy , we want to know: How good is this policy? This leads us to value functions.
Value Functions: Measuring Policy Quality
State-Value Function
The state-value function is the expected cumulative discounted reward starting from state and following policy :
Where:
– denotes expectation under policy
– is the reward at time step
– exponentially decays future rewards
Action-Value Function
The action-value function (Q-function) extends this concept:
is the expected return from state , taking action first, then following .
Relationship: (for stochastic policies)
Why the Discount Factor?
The discount factor serves three purposes:
- Mathematical convergence: Ensures infinite-horizon sums converge
- Uncertainty modeling: Future rewards are less certain, so we value them less
- Control urgency: close to 0 = myopic (greedy), close to 1 = far-sighted
Example: With , a reward 10 steps away is worth only of an immediate reward.
The Bellman Equations: Recursive Relationships
Bellman Expectation Equations
The value functions satisfy recursive relationships called Bellman equations. Let’s derive the Bellman expectation equation for :
Starting from the definition:
Separate the immediate reward from future rewards:
The future part is just :
Expanding the expectation over actions and next states:
This is the Bellman expectation equation for .
Similarly, for :
Bellman Optimality Equations
The optimal value function is the maximum value achievable from state under any policy:
The Bellman optimality equation is:
For :
These equations are fixed-point equations—the optimal value function is the solution that satisfies these recursive relationships.
Key insight: If we know or , we can extract the optimal policy:
Or more simply with :
Dynamic Programming: Solving MDPs
When we know the full MDP model , we can solve for optimal policies using dynamic programming. Two classic algorithms: value iteration and policy iteration.
Value Iteration
Value iteration iteratively applies the Bellman optimality equation as an update rule:
Start with arbitrary (e.g., all zeros) and repeat until convergence.
Python Implementation
import numpy as np
# 4x4 Grid World MDP
class GridWorld:
def __init__(self):
self.size = 4
self.n_states = self.size * self.size
self.actions = ['up', 'down', 'left', 'right']
self.n_actions = len(self.actions)
self.goal_state = 15 # Bottom-right corner
self.obstacle_state = 6 # Obstacle at (1, 2)
self.gamma = 0.9
def get_next_state(self, state, action):
"""Deterministic transition function"""
row, col = state // self.size, state % self.size
if action == 'up':
row = max(0, row - 1)
elif action == 'down':
row = min(self.size - 1, row + 1)
elif action == 'left':
col = max(0, col - 1)
elif action == 'right':
col = min(self.size - 1, col + 1)
return row * self.size + col
def get_reward(self, state, action, next_state):
"""Reward function"""
if next_state == self.goal_state:
return 10.0
elif next_state == self.obstacle_state:
return -10.0
else:
return -1.0 # Step penalty
def value_iteration(env, theta=1e-6, max_iterations=1000):
"""Value iteration algorithm"""
V = np.zeros(env.n_states)
for iteration in range(max_iterations):
delta = 0
V_old = V.copy()
for s in range(env.n_states):
if s == env.goal_state: # Terminal state
V[s] = 0
continue
# Compute max over actions
action_values = []
for action in env.actions:
s_prime = env.get_next_state(s, action)
reward = env.get_reward(s, action, s_prime)
action_values.append(reward + env.gamma * V_old[s_prime])
V[s] = max(action_values)
delta = max(delta, abs(V[s] - V_old[s]))
if delta < theta:
print(f"Converged in {iteration + 1} iterations")
break
# Extract optimal policy
policy = np.zeros(env.n_states, dtype=int)
for s in range(env.n_states):
if s == env.goal_state:
continue
action_values = []
for action_idx, action in enumerate(env.actions):
s_prime = env.get_next_state(s, action)
reward = env.get_reward(s, action, s_prime)
action_values.append(reward + env.gamma * V[s_prime])
policy[s] = np.argmax(action_values)
return V, policy
# Run value iteration
env = GridWorld()
V_star, pi_star = value_iteration(env)
print("\nOptimal Value Function:")
print(V_star.reshape(4, 4))
print("\nOptimal Policy (0=up, 1=down, 2=left, 3=right):")
print(pi_star.reshape(4, 4))
Output interpretation: The value function shows expected cumulative rewards from each state. States closer to the goal have higher values. The policy shows the optimal action at each state.
Policy Iteration
Policy iteration alternates between two steps:
- Policy Evaluation: Compute for current policy by solving the Bellman expectation equation
- Policy Improvement: Update policy greedily:
Repeat until policy converges.
def policy_evaluation(env, policy, theta=1e-6):
"""Evaluate a given policy"""
V = np.zeros(env.n_states)
while True:
delta = 0
V_old = V.copy()
for s in range(env.n_states):
if s == env.goal_state:
V[s] = 0
continue
action = env.actions[policy[s]]
s_prime = env.get_next_state(s, action)
reward = env.get_reward(s, action, s_prime)
V[s] = reward + env.gamma * V_old[s_prime]
delta = max(delta, abs(V[s] - V_old[s]))
if delta < theta:
break
return V
def policy_iteration(env, max_iterations=100):
"""Policy iteration algorithm"""
# Start with random policy
policy = np.random.randint(0, env.n_actions, env.n_states)
for iteration in range(max_iterations):
# Policy Evaluation
V = policy_evaluation(env, policy)
# Policy Improvement
policy_stable = True
for s in range(env.n_states):
if s == env.goal_state:
continue
old_action = policy[s]
# Greedy policy update
action_values = []
for action in env.actions:
s_prime = env.get_next_state(s, action)
reward = env.get_reward(s, action, s_prime)
action_values.append(reward + env.gamma * V[s_prime])
policy[s] = np.argmax(action_values)
if old_action != policy[s]:
policy_stable = False
if policy_stable:
print(f"Policy converged in {iteration + 1} iterations")
break
return V, policy
# Run policy iteration
V_pi, pi_pi = policy_iteration(env)
print("\nPolicy Iteration - Optimal Value Function:")
print(V_pi.reshape(4, 4))
Comparison: Policy iteration often converges in fewer iterations than value iteration because it explicitly maintains and improves a policy. Value iteration implicitly improves the policy through value updates.
Example: Simple Trading MDP
Let’s model a simplified trading scenario:
- States: {Cash, Stock} (binary: holding cash or holding stock)
- Actions: {Hold, Buy, Sell}
- Transitions: Stock price moves up/down with probabilities
- Rewards: Profit/loss from trading
class TradingMDP:
def __init__(self):
self.states = ['cash', 'stock']
self.actions = ['hold', 'buy', 'sell']
self.gamma = 0.95
# Price change probabilities
self.p_up = 0.6 # Probability price goes up
self.p_down = 0.4
# Price change magnitude
self.price_change = 0.02 # 2% change
def transition_prob(self, s, a, s_prime):
"""Return P(s' | s, a)"""
if a == 'hold':
return 1.0 if s == s_prime else 0.0
elif a == 'buy' and s == 'cash':
return 1.0 if s_prime == 'stock' else 0.0
elif a == 'sell' and s == 'stock':
return 1.0 if s_prime == 'cash' else 0.0
else:
return 0.0 # Invalid action
def reward(self, s, a):
"""Expected reward for action a in state s"""
if s == 'stock' and a == 'hold':
# Expected return from price movement
return self.p_up * self.price_change - self.p_down * self.price_change
elif s == 'stock' and a == 'sell':
# Lock in current value
return 0.0
elif s == 'cash' and a == 'buy':
# Transaction cost
return -0.001
else:
return 0.0
# Solve with value iteration
trading_env = TradingMDP()
print("\nSimple Trading MDP - demonstrates MDP modeling in finance")
This example illustrates how real-world decision problems (trading, resource allocation) can be formalized as MDPs.
Why MDPs Matter for Deep RL
You might wonder: “Modern RL uses neural networks, not dynamic programming. Why learn MDPs?”
The Connection
Every deep RL algorithm is solving an MDP:
- DQN (Deep Q-Network): Approximates with a neural network
- Policy Gradient methods (PPO, A3C): Directly optimize to maximize
- Actor-Critic (SAC, TD3): Learns both policy and value function
The difference: deep RL handles unknown MDPs (we don’t know or ) and large/continuous state spaces where tabular methods fail.
Key Takeaways for Deep RL
- Bellman equations are everywhere: Temporal difference (TD) learning, Q-learning, and actor-critic all use Bellman-style updates
- Value functions are universal: Whether tabular or neural, and measure policy quality
- Policy optimization: The goal is always to find that maximizes expected return
- Discount factor matters: Hyperparameter critically affects learning dynamics
In upcoming episodes, we’ll see how these MDP concepts translate to custom Gym environments (Episode 2) and deep RL algorithms (Episodes 3-4).
Practical Considerations
When is the MDP Framework Applicable?
MDPs require:
– Markov property: Current state contains all necessary information
– Stationary dynamics: Transition probabilities don’t change over time
– Discrete time steps: Decisions occur at fixed intervals
Many real problems violate these assumptions (partial observability, non-stationarity, continuous time), but MDP solutions still provide strong baselines.
Curse of Dimensionality
Dynamic programming becomes intractable for large state spaces:
– Grid world: 16 states ✓
– Chess: ~10^43 states ✗
– Robotic control: continuous state space ✗
This is why deep RL uses function approximation (neural networks) instead of tabular methods.
Exploration vs Exploitation
MDPs assume we know the model . In practice, we must explore to learn the environment while exploiting current knowledge. This tradeoff becomes central in model-free RL (covered in Episode 3).
Conclusion
Markov Decision Processes are the mathematical bedrock of reinforcement learning. We’ve covered the formal framework (states, actions, transitions, rewards, discount factor), derived the Bellman equations from first principles, and implemented value iteration and policy iteration in Python.
Key concepts to remember:
– Value functions and quantify policy quality
– Bellman equations provide recursive relationships that enable efficient computation
– Dynamic programming solves MDPs exactly when the model is known
– Optimal policy can be extracted from optimal value functions
These foundations will carry through the entire series. In the next episode, we’ll move from theory to practice by building custom Gym environments where our agents will learn through interaction—no model required. We’ll see how the abstract MDP notation translates into concrete Python classes with step() and reset() methods.
Understanding MDPs deeply will make every deep RL algorithm you encounter—from DQN to PPO—feel less like magic and more like principled optimization. The mathematics we’ve covered today isn’t just academic; it’s the blueprint that guides how neural networks learn to play games, control robots, and optimize complex systems.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply