Machine Learning Models for Stock Price Prediction: Why Most Fail and What Actually Works

Updated Feb 6, 2026

The Uncomfortable Truth About ML in Trading

Most machine learning models for stock prediction fail in production. Not because the models are bad—they’re often sophisticated, well-tuned, and validated on historical data. They fail because the market doesn’t care about your validation accuracy.

The promise is seductive: feed historical prices into an LSTM, watch it learn patterns, deploy it, and profit. Reality is messier. Your model might achieve 85% directional accuracy on test data, then bleed money the moment real trades execute. The gap between backtest performance and live trading isn’t a minor detail—it’s the entire problem.

This isn’t an argument against using ML in quant trading. It’s a warning about how to use it correctly. The models that work in production treat prediction as one component in a system that accounts for transaction costs, regime changes, and the fact that everyone else is also trying to predict the same thing.

A sleek modern robot toy standing against a colorful gradient backdrop, offering ample copy space. — Photo by Pavel Danilyuk on Pexels

The Wrong Way (That Everyone Tries First)

The standard approach goes like this: collect daily close prices, create features (maybe RSI, MACD, moving averages from Part 3), split into train/test, fit a random forest or gradient boosting model, evaluate accuracy, deploy. The notebook looks great. The confusion matrix is impressive. Then you run it forward and watch it lose money.

Here’s why this fails:

Data leakage is everywhere. Using technical indicators calculated on the full dataset means your training data contains information from the future. Even if you’re careful with time-series splits, features like “20-day moving average” computed at time $t$ include data from $t-19$ to $t$ , but if you’re predicting the close at $t+1$ , you need to ensure all features use only data available before $t+1$ opens. Most tutorials skip this.

The target is wrong. Predicting next-day return direction (up/down) ignores magnitude. A model that correctly predicts 100 small moves (+0.1% each) but misses 10 large moves (-2% each) has 91% accuracy and loses money. The loss function $L = -\sum_{i=1}^{N} \mathbb{1}(y_i = \hat{y}_i)$ (binary accuracy) doesn’t align with profit.

Transaction costs kill edge. If your model predicts a 0.3% expected return but trading costs 0.2% round-trip (realistic for retail, higher for frequent rebalancing), your edge is 0.1%. Small prediction errors or slight miscalibration wipes this out entirely.

Non-stationarity. Market regimes shift. A model trained on 2015-2020 data learned relationships that may not hold in 2021+. The correlation structure changes, volatility regimes shift, and your beautiful validation curve from a stationary train/test split becomes meaningless.

But let’s build it anyway, because seeing it fail is instructive.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import yfinance as yf
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

# Fetch data (using SPY as example)
ticker = yf.Ticker("SPY")
df = ticker.history(period="5y", interval="1d")

# Create features: simple returns and lagged features
df['return'] = df['Close'].pct_change()
df['return_lag1'] = df['return'].shift(1)
df['return_lag2'] = df['return'].shift(2)
df['return_lag5'] = df['return'].shift(5)
df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()

# Target: next day return direction
df['target'] = (df['return'].shift(-1) > 0).astype(int)  # 1 if next day up, 0 if down

df.dropna(inplace=True)

# Time-series split: train on first 80%, test on last 20%
split_idx = int(len(df) * 0.8)
train = df.iloc[:split_idx]
test = df.iloc[split_idx:]

feature_cols = ['return_lag1', 'return_lag2', 'return_lag5', 'volume_ratio']
X_train, y_train = train[feature_cols], train['target']
X_test, y_test = test[feature_cols], test['target']

# Train random forest
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred))

On SPY data from 2019-2024 (tested locally, Python 3.11), this typically gives 52-54% accuracy. Barely better than random. But here’s the thing: even if you got 60%, it doesn’t tell you if the strategy makes money after costs.

Let’s simulate actual trading:

# Simulate trading: long if predict up, flat otherwise (ignoring short for simplicity)
test['predicted'] = y_pred
test['strategy_return'] = test['predicted'] * test['return'].shift(-1)  # next day return if we go long
test['strategy_return_after_cost'] = test['strategy_return'] - 0.001  # assume 10bps round-trip cost

cumulative_return = (1 + test['strategy_return_after_cost']).cumprod()[-1] - 1
buy_hold_return = (1 + test['return']).cumprod()[-1] - 1

print(f"Strategy Return: {cumulative_return:.2%}")
print(f"Buy & Hold Return: {buy_hold_return:.2%}")

In my tests, the strategy often underperforms buy-and-hold after transaction costs. The model isn’t predicting well enough to overcome friction.

What Actually Matters: The Target and the Loss

The shift from “predict direction” to “predict expected return” is subtle but critical. Instead of classification, treat this as regression with a custom loss that penalizes errors proportional to their trading impact.

Define the target as next-period log return: $r_{t+1} = \log(P_{t+1} / P_t)$ . The model predicts $\hat{r}_{t+1}$ . The loss should reflect trading P&L, not just squared error. One approach: directional loss that penalizes sign errors more heavily.

$L = \sum_{i=1}^{N} \begin{cases} (r_i – \hat{r}_i)^2 & \text{if } \text{sign}(r_i) = \text{sign}(\hat{r}_i) \\ \alpha (r_i – \hat{r}_i)^2 & \text{if } \text{sign}(r_i) \neq \text{sign}(\hat{r}_i) \end{cases}$

where $\alpha > 1$ penalizes wrong-direction predictions more heavily. In practice, I’d use $\alpha = 5$ or $10$ . Implementing this requires a custom loss in XGBoost or LightGBM (not straightforward in sklearn).

Alternatively, frame it as a classification problem with asymmetric costs: false positives (predict up, market down) cost more than false negatives (predict down, miss upside). Sklearn’s class_weight parameter helps, but it’s not as flexible as a custom loss.

Here’s a cleaner approach: predict log return, then threshold on expected profit after transaction costs.

from sklearn.ensemble import GradientBoostingRegressor

# Target: next day log return
df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))
df['target_log_return'] = df['log_return'].shift(-1)

df.dropna(inplace=True)
split_idx = int(len(df) * 0.8)
train = df.iloc[:split_idx]
test = df.iloc[split_idx:]

feature_cols = ['return_lag1', 'return_lag2', 'return_lag5', 'volume_ratio']
X_train, y_train = train[feature_cols], train['target_log_return']
X_test, y_test = test[feature_cols], test['target_log_return']

gbr = GradientBoostingRegressor(n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42)
gbr.fit(X_train, y_train)

test['predicted_return'] = gbr.predict(X_test)

# Only trade if predicted return exceeds transaction cost threshold
transaction_cost = 0.001  # 10bps
test['position'] = np.where(test['predicted_return'] > transaction_cost, 1, 0)

test['strategy_return'] = test['position'] * test['target_log_return'] - (test['position'] * transaction_cost)
cumulative_return = np.exp(test['strategy_return'].sum()) - 1
buy_hold_return = np.exp(test['target_log_return'].sum()) - 1

print(f"Strategy Return: {cumulative_return:.2%}")
print(f"Buy & Hold Return: {buy_hold_return:.2%}")
print(f"Number of trades: {test['position'].sum()}")

This thresholding is crucial. If the model predicts +0.05% return but cost is 0.10%, don’t trade. The position logic enforces this, reducing trade frequency and focusing on higher-conviction signals.

In my tests, this regression approach with thresholding performs closer to buy-and-hold (sometimes slightly better, sometimes slightly worse), but critically, it doesn’t bleed money via overtrading.

Features That Might Actually Help

Lagged returns alone are weak predictors. The market is mostly efficient on short timescales for liquid assets. What can add signal?

Volatility regime: Realized volatility over the past 20 days, normalized by its 6-month rolling mean. Markets behave differently in low-vol vs high-vol regimes. The ratio $\sigma_{20d} / \bar{\sigma}_{120d}$ can signal regime shifts.

Order flow imbalance (if you have tick data): The net buy-sell pressure. Most retail traders don’t have access to this, but if you do, it’s one of the few features with genuine alpha.

Sentiment from alternative data: News sentiment, social media mention volume, etc. Tricky to source and integrate, but less crowded than pure price-based features.

Cross-asset signals: VIX for equities, credit spreads, yield curve shape. These capture macro conditions that single-stock price history misses.

Residual features from factor models: If you’re trading stocks, compute the residual after regressing returns on Fama-French factors. Predict the residual, not raw return. This isolates idiosyncratic moves, which might be more predictable than beta-driven moves.

Here’s an example adding volatility and VIX:

# Add realized volatility feature
df['realized_vol'] = df['log_return'].rolling(20).std()
df['vol_ratio'] = df['realized_vol'] / df['realized_vol'].rolling(120).mean()

# Fetch VIX (if trading SPY, VIX is relevant)
vix = yf.Ticker("^VIX").history(period="5y", interval="1d")['Close']
df = df.join(vix.rename('VIX'), how='left')
df['VIX'] = df['VIX'].ffill()  # forward fill missing VIX data
df['vix_change'] = df['VIX'].pct_change()

feature_cols = ['return_lag1', 'return_lag2', 'return_lag5', 'volume_ratio', 'vol_ratio', 'vix_change']
# ... rest of the training code same as above

The VIX term often shows negative correlation with SPY returns (VIX spikes when market drops), so including it can improve predictions during volatile periods.

Does this suddenly make the model profitable? Not necessarily. But it moves the needle from “slightly worse than random” to “sometimes useful as one input among many.”

Why LSTMs and Transformers Probably Won’t Save You

Deep learning for time-series prediction is tempting. LSTMs and GRUs can model sequential dependencies, and Transformers with attention mechanisms are state-of-the-art in NLP and other domains. Surely they can learn market patterns?

The problem: financial time series are extremely noisy with low signal-to-noise ratio. The Sharpe ratio of the underlying predictable component (if it exists) is often <0.5 annual. Training deep networks on such data leads to overfitting unless you have massive amounts of data and careful regularization.

LSTMs require more data than tree-based models to generalize. For daily stock data, 5 years = ~1250 samples. That’s not enough for a network with thousands of parameters to learn stable patterns without overfitting. You can try it:

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

# Reshape data for LSTM: (batch, seq_len, features)
seq_len = 20  # use 20-day windows
X_train_seq, y_train_seq = [], []
for i in range(seq_len, len(train)):
    X_train_seq.append(train[feature_cols].iloc[i-seq_len:i].values)
    y_train_seq.append(train['target_log_return'].iloc[i])

X_train_seq = torch.tensor(np.array(X_train_seq), dtype=torch.float32)
y_train_seq = torch.tensor(np.array(y_train_seq), dtype=torch.float32).unsqueeze(1)

train_dataset = TensorDataset(X_train_seq, y_train_seq)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=False)  # no shuffle for time-series

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size=50, num_layers=1):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])  # use last timestep output
        return out

model = LSTMModel(input_size=len(feature_cols), hidden_size=50)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(50):
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1}, Loss: {loss.item():.6f}")

In my experience (tested on M1 MacBook with the setup above), the LSTM often converges to predicting near-zero for all inputs—essentially learning the mean. The loss decreases, but predictions are useless. Adding dropout, L2 regularization, or early stopping helps slightly, but rarely outperforms gradient boosting.

Transformers are even more parameter-hungry. Unless you’re training on tick-level data across thousands of assets (which some quant funds do), they’re overkill.

That said, there are niches where deep learning shines: alternative data (processing text, images) or multi-modal inputs. If you’re combining price data with news headlines, a Transformer encoder for text + MLP for price features, fused in a late layer, can work. But for pure price prediction on single-asset daily data, tree-based models are simpler and more robust.

Ensemble and Meta-Labeling: Combining Models Correctly

One model is never enough. The idea: train multiple models with different feature sets, timeframes, or algorithms, then combine their predictions. Simple averaging often works, but weighting by recent performance is better.

Here’s a practical approach:

Train 3-5 models: Random Forest, Gradient Boosting, Linear Regression (yes, simple linear can be useful as a baseline), maybe an LSTM if you’re feeling ambitious.
Compute rolling out-of-sample predictions: For each model, use walk-forward validation (train on past data, predict next step, slide window forward).
Weight by recent Sharpe ratio: Over the past 60 predictions, compute each model’s Sharpe ratio. Weight predictions proportional to $\max(0, \text{Sharpe})$ .

# Assume you have predictions from 3 models: rf_pred, gbr_pred, lr_pred
# stored in test dataframe as columns

def rolling_sharpe(returns, window=60):
    """Compute rolling Sharpe ratio (annualized, assuming daily returns)"""
    roll_mean = returns.rolling(window).mean()
    roll_std = returns.rolling(window).std()
    return (roll_mean / roll_std) * np.sqrt(252)  # annualize

# Compute returns for each model's predictions
test['rf_return'] = test['rf_pred'] * test['target_log_return']
test['gbr_return'] = test['gbr_pred'] * test['target_log_return']
test['lr_return'] = test['lr_pred'] * test['target_log_return']

test['rf_sharpe'] = rolling_sharpe(test['rf_return'])
test['gbr_sharpe'] = rolling_sharpe(test['gbr_return'])
test['lr_sharpe'] = rolling_sharpe(test['lr_return'])

# Weight by Sharpe (zero out negative Sharpe)
test['rf_weight'] = test['rf_sharpe'].clip(lower=0)
test['gbr_weight'] = test['gbr_sharpe'].clip(lower=0)
test['lr_weight'] = test['lr_sharpe'].clip(lower=0)

weight_sum = test[['rf_weight', 'gbr_weight', 'lr_weight']].sum(axis=1)
# If all weights zero (shouldn't happen often), fallback to equal weight
weight_sum = weight_sum.replace(0, 1)

test['ensemble_pred'] = (
    test['rf_pred'] * test['rf_weight'] +
    test['gbr_pred'] * test['gbr_weight'] +
    test['lr_pred'] * test['lr_weight']
) / weight_sum

This dynamic weighting adapts to regime changes. If one model starts underperforming, its weight drops automatically.

Another technique: meta-labeling (popularized by Marcos López de Prado). Instead of predicting return direction, predict the size of the bet. A primary model generates signals (long/short), and a secondary ML model predicts the probability of the primary model being correct. You size positions by this probability. This combines the primary model’s directional intuition with a meta-model’s confidence calibration.

I’m not entirely sure meta-labeling provides a huge edge in practice—it’s theoretically appealing but requires very careful implementation to avoid overfitting on the meta-model. My best guess is it helps when your primary model has high variance (sometimes very right, sometimes very wrong), and the meta-model can learn when to trust it.

Handling Regime Shifts with Online Learning

Markets aren’t stationary. A model trained on 2019 data will degrade by 2023. The usual approach: retrain periodically (e.g., monthly). But you can do better with online learning—updating the model incrementally as new data arrives.

For tree-based models, this is tricky (you can’t easily update a trained Random Forest). But for linear models or neural networks, stochastic gradient descent naturally supports online updates.

Here’s a simple online linear regression:

from sklearn.linear_model import SGDRegressor

# Initialize model
online_model = SGDRegressor(loss='squared_error', learning_rate='constant', eta0=0.01)

# Simulate online updates: train on each new day, predict next
predictions = []
for i in range(100, len(df) - 1):  # start after 100 days for feature stability
    X_train_online = df[feature_cols].iloc[i-100:i].values
    y_train_online = df['target_log_return'].iloc[i-100:i].values

    # Partial fit (online update)
    online_model.partial_fit(X_train_online, y_train_online)

    # Predict next day
    X_next = df[feature_cols].iloc[i:i+1].values
    pred = online_model.predict(X_next)[0]
    predictions.append(pred)

df_online_test = df.iloc[100:len(df)-1].copy()
df_online_test['predicted_return'] = predictions
df_online_test['position'] = np.where(df_online_test['predicted_return'] > transaction_cost, 1, 0)
df_online_test['strategy_return'] = df_online_test['position'] * df_online_test['target_log_return'] - (df_online_test['position'] * transaction_cost)

print(f"Online Learning Return: {(np.exp(df_online_test['strategy_return'].sum()) - 1):.2%}")

Online learning adapts to distribution shift, but it’s also more prone to overfitting recent noise. You need to tune the learning rate carefully. Too high, and the model chases noise; too low, and it doesn’t adapt.

A hybrid approach: retrain the full model quarterly, and apply online updates in between. This balances stability and adaptability.

When to Actually Use These Models (and When Not To)

Here’s the honest take: ML models for stock price prediction work best as one input in a broader trading system, not as the sole decision-maker. Use them to generate signals, then layer on risk management (from Part 5), portfolio optimization, and regime filters.

Use ML when:
– You have rich feature sets (alternative data, cross-asset signals, not just lagged prices)
– You’re trading less liquid assets where inefficiencies persist longer
– You’re willing to iterate constantly—models degrade, and maintenance is part of the job
– You have infrastructure for walk-forward validation and online retraining

Don’t use ML when:
– You’re trading highly liquid, efficient markets (large-cap US equities) on short timeframes (intraday)
– Transaction costs are high relative to expected edge
– You can’t monitor and retrain regularly
– You’re treating it as a black box without understanding why it might work

For most retail traders, a simple momentum + mean reversion factor combination (no ML) is often more robust than an overfitted gradient boosting model. ML adds complexity, and complexity without corresponding edge is just risk.

That said, if you’re building a multi-strategy system, having an ML-based component alongside factor models and technical strategies diversifies your approach. Just don’t bet the farm on it.

What I’m Still Figuring Out

Reinforcement learning for trading is an area I haven’t fully explored but find intriguing. Framing trading as an RL problem (state = market features, action = position sizing, reward = P&L) lets the agent learn policies that account for transaction costs and risk directly in the objective. The challenge: reward sparsity and non-stationarity make training unstable. Proximal Policy Optimization (PPO) or Soft Actor-Critic (SAC) seem like reasonable starting points, but I haven’t seen compelling evidence that RL outperforms well-tuned supervised models in practice—at least not without significant infrastructure and data.

Another open question: how much does feature engineering vs. model choice matter? My intuition says features are 70% of the battle, and the model (Random Forest vs XGBoost vs neural net) is 30%. But I’d love to see systematic ablation studies on this. Most research focuses on fancy architectures and ignores the feature side.

In Part 7, we’ll shift to pairs trading and statistical arbitrage—strategies that rely less on predicting absolute direction and more on relative value. These are some of the few quant strategies where retail traders can still find edge, especially in less-covered markets.

Quant Investment with Python Series (6/8)

← Previous: Risk Management and Portfolio Optimization Techniques in Python Next: Pairs Trading Is Dead (Unless You Know Where to Look) →

Did you find this helpful?

☕ Buy me a coffee

Machine Learning Models for Stock Price Prediction: Why Most Fail and What Actually Works

The Uncomfortable Truth About ML in Trading

The Wrong Way (That Everyone Tries First)

What Actually Matters: The Target and the Loss

Features That Might Actually Help

Why LSTMs and Transformers Probably Won’t Save You

Ensemble and Meta-Labeling: Combining Models Correctly

Handling Regime Shifts with Online Learning

When to Actually Use These Models (and When Not To)

What I’m Still Figuring Out

Comments

Leave a Reply Cancel reply

Machine Learning Models for Stock Price Prediction: Why Most Fail and What Actually Works

The Uncomfortable Truth About ML in Trading

The Wrong Way (That Everyone Tries First)

What Actually Matters: The Target and the Loss

Features That Might Actually Help

Why LSTMs and Transformers Probably Won’t Save You

Ensemble and Meta-Labeling: Combining Models Correctly

Handling Regime Shifts with Online Learning

When to Actually Use These Models (and When Not To)

What I’m Still Figuring Out

Related Posts

Comments

Leave a Reply Cancel reply