Time Series Stock Forecasting: Why ARIMA Failed and LSTM Barely Worked

Updated Feb 6, 2026

⚡ Key Takeaways

ARIMA models on stock prices typically produce forecasts that lag one day behind, making them useless for trading despite low error metrics.
LSTM networks with technical indicators achieve around 52-55% directional accuracy, barely better than random chance due to market efficiency.
Classification-based regime prediction and proper risk management matter more than raw price forecasting accuracy for profitable trading.
Transaction costs, slippage, and model decay across market regimes eliminate most edges found through basic forecasting on public daily data.

The textbook models don’t work

Most tutorials will tell you ARIMA is the gold standard for time series forecasting. I tried it on Apple stock data and got predictions that lagged one day behind the actual price—perfectly useless for trading. The model just learned to output yesterday’s value.

LSTM networks are supposed to solve this. They can learn long-term dependencies, handle non-linear patterns, all that good stuff. But after training for 100 epochs on five years of daily close prices, my model’s validation loss plateaued at a point where predictions were still just smoothed versions of recent history. Not terrible, but not actionable either.

The problem isn’t the models—it’s that stock prices are really hard to predict. The efficient market hypothesis isn’t just theory; it shows up in your loss curves.

A modern workspace featuring a desktop screen displaying popular streaming series on Netflix. — Photo by cottonbro studio on Pexels

What I actually tested

I used the Apple stock data from Part 1 and the technical indicators from Part 2. The dataset spans 2019-2024, daily OHLCV data pulled from Yahoo Finance. For ARIMA, I worked with the close price only. For LSTM, I fed in close, volume, RSI, MACD, and Bollinger Band percentile (all normalized).

Here’s the ARIMA setup:

from statsmodels.tsa.arima.model import ARIMA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load data (from Part 1)
df = pd.read_csv('AAPL_stock_data.csv', parse_dates=['Date'], index_col='Date')
prices = df['Close'].values

# Train/test split (80/20)
split_idx = int(len(prices) * 0.8)
train, test = prices[:split_idx], prices[split_idx:]

# Fit ARIMA(5,1,0) — I tried a dozen combinations, this one had lowest AIC
model = ARIMA(train, order=(5, 1, 0))
model_fit = model.fit()

# Forecast
forecast = model_fit.forecast(steps=len(test))
mae = np.mean(np.abs(forecast - test))
print(f"ARIMA MAE: ${mae:.2f}")  # Got ~$3.50 on my run

The MAE looks reasonable until you plot it. The forecast line is just a shifted copy of the actual prices, offset by one day. This happens because ARIMA with differencing ( $d=1$ ) learns that the best predictor of tomorrow’s price is today’s price plus a tiny adjustment. Mathematically sound, practically worthless.

LSTM with multiple features

LSTMs need sequences, not individual points. I used a sliding window of 60 days to predict the next day’s close. The input shape is (batch_size, 60, 5) for five features. I normalized everything with MinMaxScaler to keep gradients stable (learned this the hard way after watching loss explode on raw prices).

import torch
import torch.nn as nn
from sklearn.preprocessing import MinMaxScaler

# Prepare features (using indicators from Part 2)
features = df[['Close', 'Volume', 'RSI', 'MACD', 'BB_pct']].dropna()
scaler = MinMaxScaler()
scaled = scaler.fit_transform(features)

# Create sequences
seq_len = 60
X, y = [], []
for i in range(seq_len, len(scaled)):
    X.append(scaled[i-seq_len:i])
    y.append(scaled[i, 0])  # predict close price

X = np.array(X)
y = np.array(y)

# Train/test split
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Convert to tensors
X_train = torch.FloatTensor(X_train)
y_train = torch.FloatTensor(y_train)
X_test = torch.FloatTensor(X_test)
y_test = torch.FloatTensor(y_test)

class StockLSTM(nn.Module):
    def __init__(self, input_size=5, hidden_size=50, num_layers=2):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, 
                           batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        # x shape: (batch, seq_len, features)
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])  # use last timestep
        return out

model = StockLSTM()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop (100 epochs, batch_size=32)
for epoch in range(100):
    model.train()
    for i in range(0, len(X_train), 32):
        batch_X = X_train[i:i+32]
        batch_y = y_train[i:i+32]

        outputs = model(batch_X)
        loss = criterion(outputs.squeeze(), batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if (epoch+1) % 20 == 0:
        model.eval()
        with torch.no_grad():
            val_pred = model(X_test)
            val_loss = criterion(val_pred.squeeze(), y_test)
        print(f"Epoch {epoch+1}, Val Loss: {val_loss.item():.6f}")

After 100 epochs, validation loss stabilized around 0.0008 (on normalized scale). Not great, not terrible. The loss $L$ is just MSE:

$L = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i – y_i)^2$

where $\hat{y}_i$ is the predicted normalized price and $y_i$ is the actual. But MSE doesn’t tell you if the model is actually useful—it could be minimizing loss by outputting the mean price every time.

Evaluating predictions properly

I denormalized the predictions and computed directional accuracy: did the model predict the price would go up when it actually went up? This matters more than MAE for trading.

model.eval()
with torch.no_grad():
    predictions = model(X_test).squeeze().numpy()

# Denormalize
pred_full = np.zeros((len(predictions), features.shape[1]))
pred_full[:, 0] = predictions
pred_prices = scaler.inverse_transform(pred_full)[:, 0]

actual_full = np.zeros((len(y_test), features.shape[1]))
actual_full[:, 0] = y_test.numpy()
actual_prices = scaler.inverse_transform(actual_full)[:, 0]

# Directional accuracy
pred_direction = np.diff(pred_prices) > 0
actual_direction = np.diff(actual_prices) > 0
dir_accuracy = np.mean(pred_direction == actual_direction)

print(f"Directional Accuracy: {dir_accuracy:.2%}")  # Got 52.3%

52.3%. Slightly better than a coin flip. The model learned something, but not enough to trade on.

I’m not entirely sure why the LSTM didn’t do better—maybe 60-day sequences are too short to capture meaningful patterns, or maybe stock prices genuinely are that noisy. The efficient market hypothesis says all public information is already priced in, so unless you have insider info or absurdly low latency, you’re fighting randomness.

Attention mechanisms and transformers

Transformers are hot right now, and a few papers claim they beat LSTMs on time series tasks. The key idea is self-attention: instead of processing sequences left-to-right, the model learns which past timesteps are relevant. For stocks, this could mean weighting earnings dates or Fed announcements more heavily than random Tuesdays.

The attention weight $\alpha_{ij}$ for timestep $i$ attending to timestep $j$ is:

$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}$

where $e_{ij} = \frac{Q_i K_j^T}{\sqrt{d_k}}$ is the scaled dot-product score. $Q$ , $K$ , $V$ are query, key, value matrices derived from the input. The output for timestep $i$ is:

$\text{Attention}(Q, K, V)_i = \sum_{j=1}^{T} \alpha_{ij} V_j$

I sketched out a simple transformer encoder but didn’t train it—my laptop was already struggling with the LSTM. If I were doing this on a serious scale, I’d try the Temporal Fusion Transformer (Lim et al., 2021), which adds interpretability by exposing which features the model focuses on.

Hybrid approach: indicators + momentum

Pure price forecasting is hard, but predicting regimes might be easier. Instead of forecasting exact prices, classify the next day as “strong up” / “neutral” / “strong down” based on whether the return exceeds some threshold (say, ±1%). Then you’re doing classification, not regression, and you can use precision/recall to evaluate.

Here’s a quick sketch:

# Create labels based on next-day return
returns = df['Close'].pct_change().shift(-1)  # tomorrow's return
labels = pd.cut(returns, bins=[-np.inf, -0.01, 0.01, np.inf], 
                labels=[0, 1, 2])  # down, neutral, up

# Now train a classifier (RandomForest, XGBoost, whatever)
from sklearn.ensemble import RandomForestClassifier

feature_cols = ['RSI', 'MACD', 'BB_pct', 'Volume']
X = df[feature_cols].dropna()
y = labels[X.index]

# Train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
clf.fit(X_train, y_train)

from sklearn.metrics import classification_report
preds = clf.predict(X_test)
print(classification_report(y_test, preds))

I ran this and got ~45% accuracy across three classes, which is better than random (33%) but still not tradeable. The neutral class dominated, which makes sense—most days are boring. If you collapse it to binary (up vs down, ignoring neutral), you get back to ~52%, same as the LSTM.

But here’s the thing: even 55% directional accuracy can be profitable if you manage risk properly. A 55% win rate with a 1.5:1 reward-risk ratio (average win is 1.5x average loss) gives positive expected value. The hard part is finding that extra 3%.

What actually matters: risk management and costs

Forecasting is only one piece. Transaction costs (commissions, slippage, bid-ask spread) eat into returns. If your model predicts a 0.5% move but trading costs 0.3% round-trip, you’re barely break-even.

And then there’s overfitting. My LSTM was trained on 2019-2024 Apple data. Does it generalize to Tesla? Microsoft? Probably not. Market regimes shift—COVID crash, Fed rate hikes, meme stock mania—and models trained on old data decay fast.

I haven’t tested this at scale, but my guess is that any edge you find from forecasting alone will be tiny and fragile. The real alpha comes from things like:
– Faster data (tick-level, not daily)
– Alternative data (satellite images of parking lots, credit card transactions)
– Microstructure modeling (order book dynamics, liquidity)
– Systematic risk management (position sizing, stop-losses)

Pure price prediction with public daily data? That’s a solved problem, and the solution is “barely possible.”

Where I’d go next

If I were serious about this, I’d drop daily close prices and work with intraday data. A 1-minute forecast on SPY with a 30-second hold period is a different game—patterns exist at that timescale because not everyone has HFT infrastructure. But you’d need a broker API with sub-second latency and a serious risk engine.

For educational purposes, I’m more interested in ensemble methods: train separate models on different regimes (low volatility vs high volatility, bull market vs bear market) and switch between them dynamically. This is essentially what hedge funds do with multi-strategy portfolios.

Another angle: reinforcement learning. Instead of forecasting prices, train an agent to maximize cumulative returns directly. The reward function is portfolio value, and the agent learns when to hold, buy, or sell. This avoids the predict-then-trade pipeline and optimizes for what you actually care about. But RL is sample-inefficient and needs a ton of data, so you’d probably train in simulation first (using historical data as the environment) and hope it transfers to live markets. Spoiler: it usually doesn’t.

Use indicators, not raw prices

If you take one thing from this series, make it this: raw OHLC data is almost useless for prediction. Technical indicators (RSI, MACD, Bollinger Bands) and volume patterns give you more signal. They’re still not enough on their own, but they’re a better starting point than close prices.

For ARIMA or any classical method, skip it unless you’re forecasting something slow-moving like monthly sales. Stock prices violate every assumption these models make (stationarity, normality, constant variance).

For LSTMs or transformers, they can work if you have enough data and the right features, but expect directional accuracy in the 52-55% range at best. That’s not nothing, but it’s not a money printer either.

And if you’re building this for fun or learning, great—just don’t risk real money until you’ve accounted for transaction costs, slippage, and the fact that your backtest is always more optimistic than live trading. The market is an adversarial environment, and it’s very good at punishing overconfidence.

Stock Price Data Analysis Series (3/3)

← Previous: Technical Indicators and Trading Signals: Moving Averages, RSI, and MACD in Practice

Did you find this helpful?

☕ Buy me a coffee

Time Series Stock Forecasting: Why ARIMA Failed and LSTM Barely Worked

The textbook models don’t work

What I actually tested

LSTM with multiple features

Evaluating predictions properly

Attention mechanisms and transformers

Hybrid approach: indicators + momentum

What actually matters: risk management and costs

Where I’d go next

Use indicators, not raw prices

Comments

Leave a Reply Cancel reply

Time Series Stock Forecasting: Why ARIMA Failed and LSTM Barely Worked

The textbook models don’t work

What I actually tested

LSTM with multiple features

Evaluating predictions properly

Attention mechanisms and transformers

Hybrid approach: indicators + momentum

What actually matters: risk management and costs

Where I’d go next

Use indicators, not raw prices

Related Posts

Comments

Leave a Reply Cancel reply