How does Example: Simple Trading MDP work?

Let’s model a simplified trading scenario: States: {Cash, Stock} (binary: holding cash or holding stock) Actions: {Hold, Buy, Sell} Transitions: Stock price moves up/down with probabilities Rewards: Profit/loss from trading class TradingMDP: def __init__(self): self.states = [&#

Part 1: The Core of RL: Markov Decision Processes (MDP) Explained

Q: How does Dynamic Programming: Solving MDPs work?

When we know the full MDP model (P,R)(P, R)(P,R), we can solve for optimal policies using dynamic programming. Two classic algorithms: value iteration and policy iteration.

Updated Feb 6, 2026

Introduction

Reinforcement Learning (RL) has revolutionized fields ranging from robotics to game playing, from autonomous driving to financial trading. At the heart of every RL algorithm—whether it’s DQN, PPO, or SAC—lies a mathematical framework called the Markov Decision Process (MDP). Understanding MDPs is not optional; it’s the foundation upon which all modern deep RL is built.

In this first episode of our series, we’ll demystify the MDP framework from the ground up. We’ll cover states, actions, transition probabilities, rewards, and the discount factor. We’ll derive the famous Bellman equations step-by-step, explore value functions and policies, and implement classic dynamic programming algorithms in Python. By the end, you’ll understand why MDPs matter and how they connect to the deep RL methods we’ll explore in later episodes.

What is a Markov Decision Process?

The Formal Framework

An MDP is defined by a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ :

$\mathcal{S}$ : State space (set of all possible states)
$\mathcal{A}$ : Action space (set of all possible actions)
$P$ : Transition probability function $P(s' | s, a)$ , the probability of reaching state $s'$ from state $s$ after taking action $a$
$R$ : Reward function $R(s, a, s')$ or $R(s, a)$ , the immediate reward received
$\gamma$ : Discount factor $\gamma \in [0, 1]$ , controlling the importance of future rewards

The Markov Property

The “Markov” in MDP refers to the Markov property: the future depends only on the current state, not on the history of past states. Formally:

$P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots, s_0, a_0) = P(s_{t+1} | s_t, a_t)$

This memoryless property is crucial—it allows us to reason about optimal decisions using only the current state, dramatically simplifying the mathematics and computation.

Example: Grid World

Consider a 4×4 grid where an agent starts at the top-left corner and tries to reach a goal at the bottom-right corner:

S	·	·	·
·	·	X	·
·	·	·	·
·	·	·	G

States: Each cell position (16 total states)
Actions: {Up, Down, Left, Right}
Transitions: Deterministic movement (or stochastic with 80% success rate)
Rewards: -1 per step, +10 at goal, -10 at obstacle (X)
Discount: $\gamma = 0.9$

This simple environment encapsulates all MDP components and will serve as our running example.

Policies: The Agent’s Strategy

Deterministic vs Stochastic Policies

A policy $\pi$ defines the agent’s behavior—how it chooses actions given states.

Deterministic policy: $\pi(s) = a$ (one action per state)

Stochastic policy: $\pi(a|s)$ (probability distribution over actions)

For example:
– Deterministic: “Always go right in state (0,0)”
– Stochastic: “Go right with 70% probability, down with 30% in state (0,0)”

Deep RL algorithms often learn stochastic policies (e.g., policy gradient methods) or deterministic policies (e.g., DDPG).

Policy Evaluation

Given a policy $\pi$ , we want to know: How good is this policy? This leads us to value functions.

Value Functions: Measuring Policy Quality

State-Value Function $V^\pi(s)$

The state-value function $V^\pi(s)$ is the expected cumulative discounted reward starting from state $s$ and following policy $\pi$ :

$V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s \right]$

Where:
– $\mathbb{E}_\pi$ denotes expectation under policy $\pi$
– $R_{t+1}$ is the reward at time step $t+1$
– $\gamma^t$ exponentially decays future rewards

Action-Value Function $Q^\pi(s, a)$

The action-value function (Q-function) extends this concept:

$Q^\pi(s, a) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s, A_0 = a \right]$

$Q^\pi(s, a)$

is the expected return from state , taking action first, then following .

Relationship: $V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a)$ (for stochastic policies)

Why the Discount Factor?

The discount factor $\gamma$ serves three purposes:

Mathematical convergence: Ensures infinite-horizon sums converge
Uncertainty modeling: Future rewards are less certain, so we value them less
Control urgency: $\gamma$ close to 0 = myopic (greedy), close to 1 = far-sighted

Example: With $\gamma = 0.9$ , a reward 10 steps away is worth only $0.9^{10} \approx 0.35$ of an immediate reward.

The Bellman Equations: Recursive Relationships

Bellman Expectation Equations

The value functions satisfy recursive relationships called Bellman equations. Let’s derive the Bellman expectation equation for $V^\pi(s)$ :

Starting from the definition:

$V^\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \mid S_t = s]$

Separate the immediate reward from future rewards:

$V^\pi(s) = \mathbb{E}_\pi [R_{t+1} \mid S_t = s] + \gamma \mathbb{E}_\pi [R_{t+2} + \gamma R_{t+3} + \ldots \mid S_t = s]$

The future part is just $V^\pi(S_{t+1})$ :

$V^\pi(s) = \mathbb{E}_\pi [R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s]$

Expanding the expectation over actions and next states:

$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^\pi(s')]$

This is the Bellman expectation equation for $V^\pi$ .

Similarly, for $Q^\pi(s,a)$ :

$Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')\right]$

Bellman Optimality Equations

The optimal value function $V^*(s)$ is the maximum value achievable from state $s$ under any policy:

$V^*(s) = \max_\pi V^\pi(s)$

The Bellman optimality equation is:

$V^*(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')]$

For $Q^*(s,a)$ :

$Q^*(s,a) = \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]$

These equations are fixed-point equations—the optimal value function is the solution that satisfies these recursive relationships.

Key insight: If we know $V^*$ or $Q^*$ , we can extract the optimal policy:

$\pi^*(s) = \arg\max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^*(s')]$

Or more simply with $Q^*$ :

$\pi^*(s) = \arg\max_a Q^*(s,a)$

Dynamic Programming: Solving MDPs

When we know the full MDP model $(P, R)$ , we can solve for optimal policies using dynamic programming. Two classic algorithms: value iteration and policy iteration.

Value Iteration

Value iteration iteratively applies the Bellman optimality equation as an update rule:

$V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V_k(s')]$

Start with arbitrary $V_0$ (e.g., all zeros) and repeat until convergence.

Python Implementation

import numpy as np

# 4x4 Grid World MDP
class GridWorld:
    def __init__(self):
        self.size = 4
        self.n_states = self.size * self.size
        self.actions = ['up', 'down', 'left', 'right']
        self.n_actions = len(self.actions)
        self.goal_state = 15  # Bottom-right corner
        self.obstacle_state = 6  # Obstacle at (1, 2)
        self.gamma = 0.9

    def get_next_state(self, state, action):
        """Deterministic transition function"""
        row, col = state // self.size, state % self.size

        if action == 'up':
            row = max(0, row - 1)
        elif action == 'down':
            row = min(self.size - 1, row + 1)
        elif action == 'left':
            col = max(0, col - 1)
        elif action == 'right':
            col = min(self.size - 1, col + 1)

        return row * self.size + col

    def get_reward(self, state, action, next_state):
        """Reward function"""
        if next_state == self.goal_state:
            return 10.0
        elif next_state == self.obstacle_state:
            return -10.0
        else:
            return -1.0  # Step penalty

def value_iteration(env, theta=1e-6, max_iterations=1000):
    """Value iteration algorithm"""
    V = np.zeros(env.n_states)

    for iteration in range(max_iterations):
        delta = 0
        V_old = V.copy()

        for s in range(env.n_states):
            if s == env.goal_state:  # Terminal state
                V[s] = 0
                continue

            # Compute max over actions
            action_values = []
            for action in env.actions:
                s_prime = env.get_next_state(s, action)
                reward = env.get_reward(s, action, s_prime)
                action_values.append(reward + env.gamma * V_old[s_prime])

            V[s] = max(action_values)
            delta = max(delta, abs(V[s] - V_old[s]))

        if delta < theta:
            print(f"Converged in {iteration + 1} iterations")
            break

    # Extract optimal policy
    policy = np.zeros(env.n_states, dtype=int)
    for s in range(env.n_states):
        if s == env.goal_state:
            continue

        action_values = []
        for action_idx, action in enumerate(env.actions):
            s_prime = env.get_next_state(s, action)
            reward = env.get_reward(s, action, s_prime)
            action_values.append(reward + env.gamma * V[s_prime])

        policy[s] = np.argmax(action_values)

    return V, policy

# Run value iteration
env = GridWorld()
V_star, pi_star = value_iteration(env)

print("\nOptimal Value Function:")
print(V_star.reshape(4, 4))

print("\nOptimal Policy (0=up, 1=down, 2=left, 3=right):")
print(pi_star.reshape(4, 4))

Output interpretation: The value function shows expected cumulative rewards from each state. States closer to the goal have higher values. The policy shows the optimal action at each state.

Policy Iteration

Policy iteration alternates between two steps:

Policy Evaluation: Compute $V^\pi$ for current policy $\pi$ by solving the Bellman expectation equation
Policy Improvement: Update policy greedily: $\pi'(s) = \arg\max_a \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^\pi(s')]$

Repeat until policy converges.

def policy_evaluation(env, policy, theta=1e-6):
    """Evaluate a given policy"""
    V = np.zeros(env.n_states)

    while True:
        delta = 0
        V_old = V.copy()

        for s in range(env.n_states):
            if s == env.goal_state:
                V[s] = 0
                continue

            action = env.actions[policy[s]]
            s_prime = env.get_next_state(s, action)
            reward = env.get_reward(s, action, s_prime)
            V[s] = reward + env.gamma * V_old[s_prime]

            delta = max(delta, abs(V[s] - V_old[s]))

        if delta < theta:
            break

    return V

def policy_iteration(env, max_iterations=100):
    """Policy iteration algorithm"""
    # Start with random policy
    policy = np.random.randint(0, env.n_actions, env.n_states)

    for iteration in range(max_iterations):
        # Policy Evaluation
        V = policy_evaluation(env, policy)

        # Policy Improvement
        policy_stable = True
        for s in range(env.n_states):
            if s == env.goal_state:
                continue

            old_action = policy[s]

            # Greedy policy update
            action_values = []
            for action in env.actions:
                s_prime = env.get_next_state(s, action)
                reward = env.get_reward(s, action, s_prime)
                action_values.append(reward + env.gamma * V[s_prime])

            policy[s] = np.argmax(action_values)

            if old_action != policy[s]:
                policy_stable = False

        if policy_stable:
            print(f"Policy converged in {iteration + 1} iterations")
            break

    return V, policy

# Run policy iteration
V_pi, pi_pi = policy_iteration(env)
print("\nPolicy Iteration - Optimal Value Function:")
print(V_pi.reshape(4, 4))

Comparison: Policy iteration often converges in fewer iterations than value iteration because it explicitly maintains and improves a policy. Value iteration implicitly improves the policy through value updates.

Example: Simple Trading MDP

Let’s model a simplified trading scenario:

States: {Cash, Stock} (binary: holding cash or holding stock)
Actions: {Hold, Buy, Sell}
Transitions: Stock price moves up/down with probabilities
Rewards: Profit/loss from trading

class TradingMDP:
    def __init__(self):
        self.states = ['cash', 'stock']
        self.actions = ['hold', 'buy', 'sell']
        self.gamma = 0.95

        # Price change probabilities
        self.p_up = 0.6  # Probability price goes up
        self.p_down = 0.4

        # Price change magnitude
        self.price_change = 0.02  # 2% change

    def transition_prob(self, s, a, s_prime):
        """Return P(s' | s, a)"""
        if a == 'hold':
            return 1.0 if s == s_prime else 0.0
        elif a == 'buy' and s == 'cash':
            return 1.0 if s_prime == 'stock' else 0.0
        elif a == 'sell' and s == 'stock':
            return 1.0 if s_prime == 'cash' else 0.0
        else:
            return 0.0  # Invalid action

    def reward(self, s, a):
        """Expected reward for action a in state s"""
        if s == 'stock' and a == 'hold':
            # Expected return from price movement
            return self.p_up * self.price_change - self.p_down * self.price_change
        elif s == 'stock' and a == 'sell':
            # Lock in current value
            return 0.0
        elif s == 'cash' and a == 'buy':
            # Transaction cost
            return -0.001
        else:
            return 0.0

# Solve with value iteration
trading_env = TradingMDP()
print("\nSimple Trading MDP - demonstrates MDP modeling in finance")

This example illustrates how real-world decision problems (trading, resource allocation) can be formalized as MDPs.

Why MDPs Matter for Deep RL

You might wonder: “Modern RL uses neural networks, not dynamic programming. Why learn MDPs?”

The Connection

Every deep RL algorithm is solving an MDP:

DQN (Deep Q-Network): Approximates $Q^*(s,a)$ with a neural network
Policy Gradient methods (PPO, A3C): Directly optimize $\pi(a|s)$ to maximize $V^\pi(s)$
Actor-Critic (SAC, TD3): Learns both policy and value function

The difference: deep RL handles unknown MDPs (we don’t know $P$ or $R$ ) and large/continuous state spaces where tabular methods fail.

Key Takeaways for Deep RL

Bellman equations are everywhere: Temporal difference (TD) learning, Q-learning, and actor-critic all use Bellman-style updates
Value functions are universal: Whether tabular or neural, $V$ and $Q$ measure policy quality
Policy optimization: The goal is always to find $\pi^*$ that maximizes expected return
Discount factor matters: Hyperparameter $\gamma$ critically affects learning dynamics

In upcoming episodes, we’ll see how these MDP concepts translate to custom Gym environments (Episode 2) and deep RL algorithms (Episodes 3-4).

Practical Considerations

When is the MDP Framework Applicable?

MDPs require:
– Markov property: Current state contains all necessary information
– Stationary dynamics: Transition probabilities don’t change over time
– Discrete time steps: Decisions occur at fixed intervals

Many real problems violate these assumptions (partial observability, non-stationarity, continuous time), but MDP solutions still provide strong baselines.

Curse of Dimensionality

Dynamic programming becomes intractable for large state spaces:
– Grid world: 16 states ✓
– Chess: ~10^43 states ✗
– Robotic control: continuous state space ✗

This is why deep RL uses function approximation (neural networks) instead of tabular methods.

Exploration vs Exploitation

MDPs assume we know the model $(P, R)$ . In practice, we must explore to learn the environment while exploiting current knowledge. This tradeoff becomes central in model-free RL (covered in Episode 3).

Conclusion

Markov Decision Processes are the mathematical bedrock of reinforcement learning. We’ve covered the formal framework (states, actions, transitions, rewards, discount factor), derived the Bellman equations from first principles, and implemented value iteration and policy iteration in Python.

Key concepts to remember:
– Value functions $V^\pi(s)$ and $Q^\pi(s,a)$ quantify policy quality
– Bellman equations provide recursive relationships that enable efficient computation
– Dynamic programming solves MDPs exactly when the model is known
– Optimal policy can be extracted from optimal value functions

These foundations will carry through the entire series. In the next episode, we’ll move from theory to practice by building custom Gym environments where our agents will learn through interaction—no model required. We’ll see how the abstract MDP notation translates into concrete Python classes with step() and reset() methods.

Understanding MDPs deeply will make every deep RL algorithm you encounter—from DQN to PPO—feel less like magic and more like principled optimization. The mathematics we’ve covered today isn’t just academic; it’s the blueprint that guides how neural networks learn to play games, control robots, and optimize complex systems.

$s$ $a$ $\pi$

Deep Reinforcement Learning: From Theory to Custom Environments Series (1/6)

Next: Part 2: Building Your First Custom Gym Environment using OpenAI Gymnasium →

Did you find this helpful?

☕ Buy me a coffee

Part 1: The Core of RL: Markov Decision Processes (MDP) Explained

Introduction

What is a Markov Decision Process?

The Formal Framework

The Markov Property

Example: Grid World

Policies: The Agent’s Strategy

Deterministic vs Stochastic Policies

Policy Evaluation

Value Functions: Measuring Policy Quality

State-Value Function $V^\pi(s)$

Action-Value Function $Q^\pi(s, a)$

Why the Discount Factor?

The Bellman Equations: Recursive Relationships

Bellman Expectation Equations

Bellman Optimality Equations

Dynamic Programming: Solving MDPs

Value Iteration

Python Implementation

Policy Iteration

Example: Simple Trading MDP

Why MDPs Matter for Deep RL

The Connection

Key Takeaways for Deep RL

Practical Considerations

When is the MDP Framework Applicable?

Curse of Dimensionality

Exploration vs Exploitation

Conclusion

Comments

Leave a Reply Cancel reply

Part 1: The Core of RL: Markov Decision Processes (MDP) Explained

Introduction

What is a Markov Decision Process?

The Formal Framework

The Markov Property

Example: Grid World

Policies: The Agent’s Strategy

Deterministic vs Stochastic Policies

Policy Evaluation

Value Functions: Measuring Policy Quality

State-Value Function Vπ(s)V^\pi(s)Vπ(s)

Action-Value Function Qπ(s,a)Q^\pi(s, a)Qπ(s,a)

Why the Discount Factor?

The Bellman Equations: Recursive Relationships

Bellman Expectation Equations

Bellman Optimality Equations

Dynamic Programming: Solving MDPs

Value Iteration

Python Implementation

Policy Iteration

Example: Simple Trading MDP

Why MDPs Matter for Deep RL

The Connection

Key Takeaways for Deep RL

Practical Considerations

When is the MDP Framework Applicable?

Curse of Dimensionality

Exploration vs Exploitation

Conclusion

Related Posts

Comments

Leave a Reply Cancel reply

State-Value Function $V^\pi(s)$

Action-Value Function $Q^\pi(s, a)$