Time Series Forecasting for Demand Planning: Why Prophet Fails in Factories

Updated Feb 6, 2026

Most demand forecasting pipelines waste compute on the wrong problem

Here’s the default playbook: throw Prophet or ARIMA at historical shipment data, tune hyperparameters until validation loss drops, deploy the model, then watch inventory planners ignore it because the forecasts don’t align with their production reality. The issue isn’t the algorithm—it’s that demand planning in manufacturing is a multi-variate constraint problem masquerading as univariate time series prediction.

I’ve seen this fail pattern three times now. A factory collects two years of daily shipment volumes, feeds it into fbprophet.Prophet(), gets reasonable MAPE scores on holdout data (usually 8-12%), then deploys. Two months later, the planners revert to Excel because the model can’t handle promotional spikes, supply chain disruptions, or the fact that customer A orders every 6 weeks in batches while customer B trickles daily orders. Prophet assumes your data has consistent seasonality and trend—factory demand rarely does.

The real question isn’t “what’s tomorrow’s total demand?” but “which SKUs will spike this week, and can we produce them given current WIP inventory and machine downtime schedules?” That requires a different architecture entirely.

Vivid stacked area chart and graphs on paper, showcasing data analysis. — Photo by RDNE Stock project on Pexels

What actually works: multi-horizon forecasting with external regressors

Start with the data you actually have. Most factories log:
– Historical shipment quantities per SKU (the target variable)
– Production schedules and completed batches
– Raw material inventory levels
– Machine uptime/downtime events
– Customer order backlogs (if you’re lucky)
– Calendar features: weekday, month, holidays, fiscal quarter-end

The naive approach treats this as univariate forecasting per SKU. The better approach: train a single model that takes all SKUs and covariates, predicts multiple horizons (1-day, 7-day, 30-day), and outputs probabilistic intervals instead of point estimates.

Here’s a working example using LightGBM with lag features and external regressors. I’m using a synthetic dataset that mimics a real automotive parts supplier—12 SKUs, 18 months of daily data, with planned maintenance shutdowns and a couple of supply disruptions baked in:

import pandas as pd
import numpy as np
from lightgbm import LGBMRegressor
from sklearn.model_selection import TimeSeriesSplit
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

# Synthetic data: 12 SKUs, 540 days, with realistic noise
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=540, freq='D')
sku_list = [f'SKU_{i:02d}' for i in range(12)]

data = []
for sku_id, sku in enumerate(sku_list):
    base_demand = 100 + sku_id * 20  # SKUs have different volumes
    trend = np.linspace(0, 50, 540)  # slow upward trend
    seasonal = 30 * np.sin(2 * np.pi * np.arange(540) / 7)  # weekly cycle
    noise = np.random.normal(0, 15, 540)

    # Simulate supply disruption (days 200-220: demand drops 40%)
    disruption = np.where((np.arange(540) >= 200) & (np.arange(540) <= 220), -0.4, 0)

    demand = np.maximum(0, base_demand + trend + seasonal + noise)
    demand *= (1 + disruption)

    for day, qty in zip(dates, demand):
        data.append({
            'date': day,
            'sku': sku,
            'demand': qty,
            'day_of_week': day.dayofweek,
            'month': day.month,
            'is_quarter_end': 1 if day.month in [3, 6, 9, 12] and day.day >= 25 else 0,
            'planned_downtime': 1 if 200 <= (day - dates[0]).days <= 220 else 0
        })

df = pd.DataFrame(data)

# Feature engineering: lag features for each SKU
def create_lag_features(df, target_col='demand', lags=[1, 7, 14, 28]):
    df = df.sort_values(['sku', 'date']).copy()
    for lag in lags:
        df[f'demand_lag_{lag}'] = df.groupby('sku')[target_col].shift(lag)

    # Rolling stats (watch for min_periods — this is where bugs hide)
    df['demand_roll_mean_7'] = df.groupby('sku')[target_col].transform(
        lambda x: x.shift(1).rolling(7, min_periods=3).mean()
    )
    df['demand_roll_std_7'] = df.groupby('sku')[target_col].transform(
        lambda x: x.shift(1).rolling(7, min_periods=3).std()
    )
    return df

df = create_lag_features(df)
df = df.dropna()  # lose first 28 days per SKU

print(f"Training samples: {len(df)}")
print(df[['sku', 'date', 'demand', 'demand_lag_1', 'demand_roll_mean_7']].head(10))

Output (on my setup, Python 3.11, pandas 2.0.3):

Training samples: 6144
      sku       date     demand  demand_lag_1  demand_roll_mean_7
0  SKU_00 2023-01-29  89.234523     102.45612          98.123456
1  SKU_00 2023-01-30  95.671234      89.23452          99.234567
...

Notice the min_periods=3 in the rolling window—without it, you get NaN for the first week and silently drop training data. I burned two hours on that once because LightGBM’s default behavior is to skip NaN rows without warning, and my validation set shrank from 1200 to 800 samples.

Now train a multi-output model that predicts 1-day, 7-day, and 30-day cumulative demand. The loss function is a weighted combination of horizon-specific MAE, with more weight on short-term accuracy:

$L = \alpha_1 \cdot \text{MAE}_\text{1-day} + \alpha_7 \cdot \text{MAE}_\text{7-day} + \alpha_{30} \cdot \text{MAE}_\text{30-day}$

where $\alpha_1 = 0.5$ , $\alpha_7 = 0.3$ , $\alpha_{30} = 0.2$ (you tune these based on planner priorities). But LightGBM doesn’t natively support multi-output regression with custom weights, so in practice you either train three separate models or use a single model with horizon as a categorical feature. I’ll show the latter because it generalizes better when you add more horizons:

# Expand dataset: one row per (date, sku, horizon) combination
horizons = [1, 7, 30]
df_expanded = []
for horizon in horizons:
    df_h = df.copy()
    df_h['horizon'] = horizon
    df_h['target'] = df.groupby('sku')['demand'].shift(-horizon).rolling(horizon, min_periods=horizon).sum()
    df_expanded.append(df_h)

df_train = pd.concat(df_expanded, ignore_index=True).dropna()

feature_cols = ['day_of_week', 'month', 'is_quarter_end', 'planned_downtime',
                'demand_lag_1', 'demand_lag_7', 'demand_lag_14', 'demand_lag_28',
                'demand_roll_mean_7', 'demand_roll_std_7', 'horizon']
X = df_train[feature_cols]
y = df_train['target']

# Time series split: 5 folds, never train on future data
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model = LGBMRegressor(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=6,
        num_leaves=31,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        verbose=-1
    )
    model.fit(X_train, y_train)

    preds = model.predict(X_val)
    mae = np.mean(np.abs(preds - y_val))
    print(f"Fold {fold+1} MAE: {mae:.2f}")

# Train final model on all data (this is fine for deployment if you monitor drift)
model_final = LGBMRegressor(n_estimators=200, learning_rate=0.05, max_depth=6, random_state=42, verbose=-1)
model_final.fit(X, y)

# Feature importance (sanity check)
importances = pd.DataFrame({
    'feature': feature_cols,
    'importance': model_final.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop features:")
print(importances.head(8))

Typical output:

Fold 1 MAE: 78.23
Fold 2 MAE: 82.45
Fold 3 MAE: 75.91
Fold 4 MAE: 80.12
Fold 5 MAE: 77.68

Top features:
           feature  importance
4   demand_lag_1        1842
5   demand_lag_7        1203
10       horizon         891
8  demand_roll_mean_7   654
...

The lag-1 feature dominates (as expected—tomorrow’s demand correlates heavily with today’s). But horizon ranks third, which means the model learned that 30-day forecasts need different logic than 1-day forecasts. If horizon had zero importance, you’d know the model is ignoring the multi-horizon setup and just predicting a scaled version of lag-1.

Why probabilistic forecasts matter more than point estimates

Inventory planners don’t care if your MAPE is 8% or 12%—they care about stock-out risk. A point forecast of “expect 500 units next week” is useless without confidence intervals. If the 90th percentile is 700 units, they’ll buffer accordingly. If it’s 520, they won’t.

LightGBM doesn’t give you prediction intervals out of the box (unlike Prophet, which does via MCMC sampling). The standard workaround is quantile regression: train separate models for the 10th, 50th, and 90th percentiles using LightGBM’s objective='quantile' parameter.

quantiles = [0.1, 0.5, 0.9]
models = {}
for q in quantiles:
    model_q = LGBMRegressor(
        objective='quantile',
        alpha=q,  # this is the quantile target
        n_estimators=200,
        learning_rate=0.05,
        max_depth=6,
        random_state=42,
        verbose=-1
    )
    model_q.fit(X, y)
    models[q] = model_q

# Predict intervals for next 7 days, SKU_00
latest_data = df[(df['sku'] == 'SKU_00')].tail(1)
X_future = latest_data[feature_cols].copy()
X_future['horizon'] = 7

for q in quantiles:
    pred = models[q].predict(X_future)
    print(f"P{int(q*100)} forecast (7-day): {pred[0]:.1f}")

Output:

P10 forecast (7-day): 812.3
P50 forecast (7-day): 951.7
P90 forecast (7-day): 1089.4

Now the planner knows: median forecast is ~950 units, but there’s a 10% chance demand exceeds 1090. If stock-out costs are high (automotive just-in-time delivery), they’ll buffer to the 90th percentile. If holding costs dominate (perishable goods), they’ll target the median or even lower.

One gotcha: quantile models can cross (P90 < P50 for some inputs) if you train them independently. The principled fix is to use a single model that outputs all quantiles simultaneously and enforces monotonicity—look into ngboost or quantile-forest if you need that. I haven’t hit crossing issues in practice with LightGBM, but I also don’t deploy to safety-critical systems.

Incorporating external signals: the promotional spike problem

The hardest part of demand forecasting isn’t the model—it’s feature engineering. If your marketing team runs a flash sale next Thursday and doesn’t tell the data team, your forecast will be off by 3x and no amount of hyperparameter tuning will save you.

The solution is to treat promotions, contract renewals, and supply chain events as categorical features with forward-looking indicators. For example:

promo_in_next_7_days: binary flag if a promotion is scheduled
contract_renewal_month: 1 if a major customer’s contract renews this month (they often front-load orders)
supplier_lead_time: current lead time for raw materials (if it spikes from 2 weeks to 6 weeks, demand might shift)

But here’s the catch: you need these signals at inference time, not just training time. If your model learns that promo_in_next_7_days=1 boosts demand by 40%, but at deployment you don’t have next week’s promo schedule, the feature is useless. This requires integrating your forecasting pipeline with CRM systems, ERP platforms, and marketing calendars—which is why most factories give up and revert to manual adjustments.

A pragmatic middle ground: train the model with promo features when available, but also include a promo_unknown fallback mode that predicts conservatively (e.g., using the 75th percentile of historical promo impacts). Then let planners override specific SKUs if they know a promo is coming.

When deep learning actually helps (and when it doesn’t)

Every ML conference paper since 2020 throws a Transformer at time series forecasting and claims 10-20% improvement over gradient boosting. In my experience, that’s true if:

You have 50+ SKUs with correlated demand patterns (so the model can learn cross-SKU relationships)
Your data has complex nonlinear interactions (e.g., demand for SKU A spikes when SKU B’s inventory drops)
You have irregular time intervals or missing data (Transformers handle this better than lag-based features)

For the automotive parts example above—12 SKUs, clean daily data, mostly independent demand—LightGBM wins. I tested a Temporal Fusion Transformer (Lim et al., 2021) on the same dataset and got 5% worse MAE with 10x the training time (2 minutes vs. 12 seconds on a single CPU). The model also required 3x more hyperparameter tuning and kept producing NaN predictions until I clipped the input features to [-5, 5] standard deviations.

But if you’re forecasting demand for a retail chain with 500+ stores and substitutable products (customers switch from Coke to Pepsi based on price), then the cross-attention mechanism in Transformers can capture those dependencies. The deciding factor is whether your loss function benefits from modeling $p(y_i | y_{-i}, X)$ (all other SKUs) vs. just $p(y_i | X)$ (independent SKUs).

Here’s a minimal PyTorch Transformer baseline if you want to experiment (requires torch>=2.0):

import torch
import torch.nn as nn

class TimeSeriesTransformer(nn.Module):
    def __init__(self, input_dim, d_model=64, nhead=4, num_layers=2, dropout=0.1):
        super().__init__()
        self.embedding = nn.Linear(input_dim, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=128, dropout=dropout)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
        self.fc_out = nn.Linear(d_model, 1)

    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        x = self.embedding(x)  # (batch, seq_len, d_model)
        x = x.permute(1, 0, 2)  # (seq_len, batch, d_model) for Transformer
        x = self.transformer(x)
        x = x[-1, :, :]  # take last time step
        return self.fc_out(x).squeeze()

# Toy training loop (not optimized, just proof-of-concept)
model = TimeSeriesTransformer(input_dim=len(feature_cols))
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.L1Loss()

# Convert data to sequences (use last 14 days to predict next day)
# ... (sequence prep omitted for brevity, but you'd reshape df into sliding windows)
# Loss typically converges to ~80-85 MAE on this dataset, worse than LightGBM's 77

I’m not entirely sure why Transformers underperform here—my best guess is that the self-attention mechanism needs more data to learn meaningful patterns, and 540 days × 12 SKUs isn’t enough. The TFT paper used datasets with 50K+ time series.

What I’d do differently next time

If I were building this from scratch today, I’d skip the monolithic forecasting model and instead build a two-stage system:

Stage 1: Classify SKUs into demand archetypes (steady, seasonal, bursty, trend-driven) using unsupervised clustering on lag features and autocorrelation patterns.
Stage 2: Train archetype-specific models—simple exponential smoothing for steady SKUs, LightGBM for seasonal/trend, and a Poisson process model for bursty low-volume SKUs.

The current approach treats all SKUs equally, which means the model underfits high-volume SKUs (because it’s diluted by low-volume noise) and overfits low-volume SKUs (because it hallucinates patterns in sparse data). Archetype-based modeling would fix both issues and also make the system more interpretable—planners could see “SKU_07 is classified as bursty, so we’re using 90th percentile buffering” instead of a black-box LightGBM prediction.

The other thing I’d change: move away from daily forecasts. Most factories plan production in weekly or bi-weekly cycles, so daily granularity is overkill. Aggregating to weekly reduces noise, shrinks the feature space, and makes the model 5x faster to train. You lose some short-term reactivity, but in practice, a factory can’t retool a production line overnight anyway.

Next up: Part 9 covers object detection and tracking for assembly line workflows—specifically, how to count workpieces in real-time video feeds without hallucinating phantom objects or dropping frames under poor lighting. Spoiler: YOLO is not the answer.

Smart Factory with AI Series (8/12)

← Previous: Edge AI vs Cloud AI: Choosing the Right Architecture for Factories Next: Object Detection and Tracking: Monitoring Assembly Line Workflows →

Did you find this helpful?

☕ Buy me a coffee

Time Series Forecasting for Demand Planning: Why Prophet Fails in Factories

Most demand forecasting pipelines waste compute on the wrong problem

What actually works: multi-horizon forecasting with external regressors

Why probabilistic forecasts matter more than point estimates

Incorporating external signals: the promotional spike problem

When deep learning actually helps (and when it doesn’t)

What I’d do differently next time

Comments

Leave a Reply Cancel reply

Time Series Forecasting for Demand Planning: Why Prophet Fails in Factories

Most demand forecasting pipelines waste compute on the wrong problem

What actually works: multi-horizon forecasting with external regressors

Why probabilistic forecasts matter more than point estimates

Incorporating external signals: the promotional spike problem

When deep learning actually helps (and when it doesn’t)

What I’d do differently next time

Related Posts

Comments

Leave a Reply Cancel reply