How does Attempt 1: DQN with Naive Action Masking work?

We started with Deep Q-Network (DQN) because it’s the RL equivalent of “hello world.” The Q-function Q(s,a)Q(s, a)Q(s,a) estimates the expected return from taking action aaa in state sss, and the policy greedily picks a∗=arg⁡max⁡aQ(s,a)a^* = \arg\max_a Q(s, a)a∗=argmaxaQ(s,a). Tra

Reinforcement Learning for Production Scheduling: Why DQN Failed and How PPO Saved Our Factory Line

Q: How does Problem Setup: Job Shop Scheduling as an MDP work?

The classic flexible job shop scheduling problem (FJSP): nnn jobs, mmm machines, each job has a sequence of operations with machine eligibility and processing times. The goal is to assign operations to machines and sequence them to minimize some objective — usually makespan Cmax=max⁡jCjC_{\text{max}

Updated Feb 6, 2026

The $10K Mistake: Why Classic Scheduling Heuristics Don’t Scale

Most production schedulers still use heuristics like Shortest Processing Time (SPT) or Earliest Due Date (EDD). They’re fast, predictable, and completely oblivious to downstream bottlenecks. I watched a factory floor run SPT for six months, hitting on-time delivery targets but racking up $10K/month in overtime because the schedule kept starving the bottleneck station. The heuristic optimized局部 cycle time without understanding the system-level constraint.

Reinforcement Learning (RL) fixes this by learning a policy that maximizes cumulative reward — makespan, tardiness penalty, machine utilization, all rolled into one objective. But here’s the thing: most RL papers test on toy job-shop problems with 10 machines and perfect state observability. Real factories have 50+ stations, stochastic breakdowns, operator shift changes, and partial information. The gap between academic benchmarks and production reality is enormous.

This post walks through what actually worked when we deployed RL scheduling on a semiconductor packaging line. Spoiler: DQN was a disaster, PPO barely worked, and the final solution involved a hybrid approach that still uses heuristics 30% of the time.

Crop anonymous female student writing in planner while sitting at wooden table in park — Photo by Anete Lusina on Pexels

Problem Setup: Job Shop Scheduling as an MDP

The classic flexible job shop scheduling problem (FJSP): $n$ jobs, $m$ machines, each job has a sequence of operations with machine eligibility and processing times. The goal is to assign operations to machines and sequence them to minimize some objective — usually makespan $C_{\text{max}} = \max_j C_j$ where $C_j$ is the completion time of job $j$ .

Formulating this as a Markov Decision Process:
– State $s_t$ : current machine status (idle/busy), job queue, operation dependencies, time elapsed. In practice, we encoded this as a graph: nodes = operations, edges = precedence constraints, node features = $(p_{ij}, \text{remaining_ops}, \text{due_date})$ where $p_{ij}$ is processing time on machine $i$ . Flattening this into a fixed-size vector was a nightmare — we ended up using a Graph Neural Network (GNN) encoder, which I’ll skip here because it’s orthogonal to the RL part.
– Action $a_t$ : select (job, machine) pair from feasible set. Action space size is $|\mathcal{A}_t| = \sum_{j \in \text{ready}} |M_j|$ where $M_j$ is the set of eligible machines for job $j$ ‘s next operation. This is variable-sized and sparse — a big problem for standard RL algorithms.
– Reward $r_t$ : we used a shaped reward combining immediate and terminal components:

$r_t = -\Delta t – \lambda \cdot \mathbb{1}_{\text{tardy}} + \gamma \cdot u_t$

where $\Delta t$ is the time step increment (penalizes makespan), $\lambda$ is a tardiness penalty, and $u_t$ is machine utilization (bonus for keeping machines busy). The terminal reward at episode end was $-C_{\text{max}} – 100 \cdot \sum_j \max(0, C_j – d_j)$ where $d_j$ is the due date. Tuning these coefficients took two weeks of trial and error.

Transition: deterministic in the toy version, stochastic in reality (machine breakdowns follow a Weibull distribution with shape $k=2.1$ , operator no-shows happen ~5% of shifts).

The episode terminates when all jobs are completed. Horizon length varied from 200 to 800 time steps depending on job mix.

Attempt 1: DQN with Naive Action Masking

We started with Deep Q-Network (DQN) because it’s the RL equivalent of “hello world.” The Q-function $Q(s, a)$ estimates the expected return from taking action $a$ in state $s$ , and the policy greedily picks $a^* = \arg\max_a Q(s, a)$ . Training uses the Bellman error:

$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q_{\theta^-}(s', a') – Q_\theta(s, a) \right)^2 \right]$

where $\theta^-$ is a target network updated every $N$ steps.

The problem: action spaces are variable-sized. Machine 3 might have 5 eligible jobs at time $t$ and 0 at time $t+1$ . We handled this with action masking — the Q-network outputs a fixed-size vector of size $n \times m$ , and we mask out invalid (job, machine) pairs by setting $Q(s, a) = -\infty$ before the argmax. This is a standard trick and it works… in theory.

In practice, DQN failed spectacularly. After 100K episodes, the policy learned to idle machines indefinitely. Why? The Q-values for all valid actions collapsed to nearly identical values (variance $< 0.01$ ), so the argmax was effectively random. My best guess is that the sparse reward signal ( $r_t = -1$ at most steps, big penalty only at the end) combined with the high-dimensional action space created a credit assignment nightmare. The agent couldn’t figure out which early decisions led to the bad terminal outcome.

We tried prioritized experience replay, dueling architecture, and multi-step returns. None helped. DQN is just a poor fit for combinatorial action spaces with delayed rewards.

Attempt 2: PPO with Action Masking and Curriculum Learning

Proximal Policy Optimization (PPO) is an on-policy actor-critic method. The actor outputs a probability distribution $\pi_\theta(a | s)$ over actions, and the critic estimates the value function $V_\phi(s)$ . The loss function is:

$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t, \; \text{clip}\left( \frac{\pi_\theta}{\pi_{\theta_{\text{old}}}}, 1-\epsilon, 1+\epsilon \right) \hat{A}_t \right) \right]$

where $\hat{A}_t = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l}$ is the Generalized Advantage Estimate (GAE) with $\delta_t = r_t + \gamma V_\phi(s_{t+1}) – V_\phi(s_t)$ . The clipping prevents the policy from changing too drastically in one update, which stabilizes training.

PPO handles variable action spaces more gracefully than DQN because the actor outputs a softmax over valid actions (after masking), not a Q-value for every possible action. We used a two-headed network: the GNN encoder produced a 256-dim embedding, which fed into both the actor (outputs logits for each action) and critic (outputs a scalar value).

But PPO still struggled on the full problem. Convergence was painfully slow — after 500K episodes, the policy was only 10% better than SPT. The breakthrough came from curriculum learning: start with small instances (5 jobs, 3 machines), gradually increase complexity. We used this schedule:

# Curriculum schedule (episode_count -> problem_size)
curriculum = [
    (0, (5, 3)),      # 0-50K episodes: 5 jobs, 3 machines
    (50000, (10, 5)),
    (150000, (20, 8)),
    (300000, (30, 12)),  # Full problem size
]

def get_problem_size(episode):
    for threshold, size in reversed(curriculum):
        if episode >= threshold:
            return size
    return curriculum[0][1]

This helped the agent learn basic scheduling logic (don’t idle when work is available, prioritize bottleneck machines) before facing the full combinatorial explosion. After 800K episodes (about 60 hours on 4x RTX 3090), the PPO policy beat SPT by 18% on average makespan and reduced tardiness by 35%.

Here’s a simplified training loop (using Stable-Baselines3):

import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import SubprocVecEnv
from job_shop_env import JobShopEnv  # custom gym env

# Vectorized environment (8 parallel workers)
env = SubprocVecEnv([lambda: JobShopEnv(curriculum=curriculum) for _ in range(8)])

model = PPO(
    "MultiInputPolicy",  # handles dict observations (graph + scalars)
    env,
    learning_rate=3e-4,
    n_steps=2048,        # rollout length per worker
    batch_size=512,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,       # entropy bonus to encourage exploration
    verbose=1,
    tensorboard_log="./ppo_jobshop/"
)

model.learn(total_timesteps=10_000_000)
model.save("ppo_jobshop_final")

One gotcha: the default entropy coefficient (ent_coef=0.0) caused premature convergence. The policy would latch onto a mediocre strategy and stop exploring. Setting ent_coef=0.01 added just enough noise to keep it searching for better solutions. But too much entropy (0.05+) made it explore randomly even after finding a good policy. This is the kind of hyperparameter that you can’t determine from first principles — you just have to sweep and watch the learning curves.

The Hybrid Solution: RL + Heuristic Fallback

Even after all that tuning, PPO had a critical flaw: inference latency. The GNN encoder + two-headed policy network took 50-80ms per decision on a CPU. In a real-time scheduling system dispatching every 10 seconds, that’s acceptable. But when a rush order arrives and you need to re-schedule 200 operations in under 1 second, it’s too slow.

Our compromise: use PPO for the first 70% of the schedule (when the action space is largest and decisions have the most impact), then switch to SPT for the tail. This cut inference time by 60% with only a 3% makespan penalty. The handoff logic:

def dispatch_next_job(state, model, threshold=0.7):
    progress = state['completed_ops'] / state['total_ops']

    if progress < threshold:
        # RL policy
        action, _states = model.predict(state, deterministic=True)
        return action
    else:
        # Fallback to SPT heuristic
        ready_jobs = state['ready_queue']
        return min(ready_jobs, key=lambda j: j.remaining_time)

We also added a “panic mode” — if the RL policy idles a machine for more than 3 consecutive decisions while work is available, override it with SPT. This happened in ~5% of episodes during deployment, usually when the state drifted outside the training distribution (e.g., three machines broke down simultaneously). The RL policy would freeze, outputting nearly uniform probabilities over all actions. The heuristic fallback prevented catastrophic failures.

What We Learned (and What Still Doesn’t Work)

After six months in production, here’s what the data showed:

Makespan: PPO hybrid averaged 12% better than pure SPT, 18% better than EDD.
Tardiness: 40% reduction in late jobs (the shaped reward actually worked).
Machine utilization: Up 6 percentage points (from 73% to 79%).
Robustness: The policy handled single-machine breakdowns gracefully, but catastrophic failures (2+ machines down) still required human override.

What I’d do differently next time:

Start with imitation learning. We had six months of historical scheduling data from human dispatchers. Pretraining the policy with behavioral cloning on that data (minimize $-\log \pi_\theta(a_{\text{human}} | s)$ ) would’ve given a much better initialization than random weights. We tried this post-hoc and saw 30% faster convergence.
Use a simpler state representation. The GNN encoder was overkill. A flat vector of (machine_status, job_priority, bottleneck_indicator) worked almost as well in offline tests and was 3x faster. We got obsessed with fancy architectures when a dumb MLP would’ve sufficed.
Model the stochasticity explicitly. We treated machine breakdowns as exogenous shocks, but they’re predictable from maintenance logs and sensor data (vibration, temperature). Feeding those features into the state would let the agent anticipate failures and schedule proactively. This is the direction I’m most curious about — combining RL with predictive maintenance signals from Part 3 of this series.
Multi-objective RL. Our scalar reward forced us to hand-tune the tradeoff between makespan and tardiness. Techniques like Pareto-optimal policy learning (e.g., using envelope Q-learning) would let the agent learn a family of policies, and the factory manager could pick the operating point. I haven’t tested this at scale, so take it with a grain of salt.

One more thing: don’t expect RL to “just work” out of the box. Stable-Baselines3 is fantastic, but it’s a toolkit, not a magic solver. We burned two months chasing bad hyperparameters and poor reward shaping. If you’re trying this yourself, start simple (small instances, dense rewards, sanity-check with a heuristic baseline), and scale up only after you’ve got something that learns at all.

In the next part, we’ll tackle the infrastructure question that everyone asks but nobody answers honestly: when should you run your models on edge devices in the factory vs. in the cloud? Latency isn’t the only consideration — cost, maintenance, and data governance matter way more than the tutorials let on.

Smart Factory with AI Series (6/12)

← Previous: Digital Twin Technology: Creating Virtual Factory Replicas with Python Next: Edge AI vs Cloud AI: Choosing the Right Architecture for Factories →

Did you find this helpful?

☕ Buy me a coffee

Reinforcement Learning for Production Scheduling: Why DQN Failed and How PPO Saved Our Factory Line

The $10K Mistake: Why Classic Scheduling Heuristics Don’t Scale

Problem Setup: Job Shop Scheduling as an MDP

Attempt 1: DQN with Naive Action Masking

Attempt 2: PPO with Action Masking and Curriculum Learning

The Hybrid Solution: RL + Heuristic Fallback

What We Learned (and What Still Doesn’t Work)

Comments

Leave a Reply Cancel reply

Reinforcement Learning for Production Scheduling: Why DQN Failed and How PPO Saved Our Factory Line

The $10K Mistake: Why Classic Scheduling Heuristics Don’t Scale

Problem Setup: Job Shop Scheduling as an MDP

Attempt 1: DQN with Naive Action Masking

Attempt 2: PPO with Action Masking and Curriculum Learning

The Hybrid Solution: RL + Heuristic Fallback

What We Learned (and What Still Doesn’t Work)

Related Posts

Comments

Leave a Reply Cancel reply