Classical vs Neural Time Series: ARIMA and LSTM Head-to-Head on Gold Prices

⚡ Key Takeaways
  • ARIMA achieves $18.42 RMSE versus LSTM's $31.87 on one-step-ahead gold price predictions, contrary to deep learning hype.
  • At 30-day horizons the relationship inverts: LSTM's $52.17 RMSE significantly outperforms ARIMA's $89.23 due to ARIMA's exploding forecast variance.
  • Stationarity testing via ADF is mandatory before applying either model—raw gold prices fail stationarity with p=0.66 while differenced prices pass with p≈0.
  • MinMaxScaler boundaries create hidden extrapolation problems when test prices exceed training range, contributing to LSTM's higher short-term error.
  • Use ARIMA for 1-7 day forecasts and LSTM for 2+ week horizons, but neither captures external macroeconomic factors driving gold prices.

The Benchmark That Changed My Mind

ARIMA beats LSTM on gold price forecasting.

That’s not a typo. When I first ran these experiments on the 10-year gold dataset from Part 1, the classical statistical model outperformed the neural network by a comfortable margin on short-term predictions. The RMSE difference wasn’t even close—ARIMA hit 18.42 versus LSTM’s 31.87 on 7-day forecasts. I’d expected the deep learning approach to dominate given all the hype, but the numbers tell a different story.

But here’s where it gets interesting: extend the forecast horizon to 30 days and the relationship flips completely.

A modern workspace featuring a desktop screen displaying popular streaming series on Netflix.
Photo by cottonbro studio on Pexels

Stationarity: The Elephant in Every Time Series Room

Before comparing models, we need to address the single biggest mistake people make with financial time series. Raw gold prices aren’t stationary—they wander around with trends and changing variance, which violates the core assumptions of most forecasting methods. ARIMA explicitly requires stationarity. LSTM doesn’t technically require it, but training on non-stationary data leads to models that memorize recent levels instead of learning actual patterns.

The standard fix is differencing: instead of predicting tomorrow’s price Pt+1P_{t+1}, you predict the change ΔPt=Pt+1Pt\Delta P_t = P_{t+1} – P_t. First-order differencing usually does the trick for prices, though you’ll occasionally need second-order differencing for particularly stubborn trends.

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller

# Load the gold price data (assuming daily OHLC format)
gold_df = pd.read_csv('gold_prices_10yr.csv', parse_dates=['Date'])
gold_df = gold_df.sort_values('Date').set_index('Date')

close_prices = gold_df['Close'].dropna()

def check_stationarity(series, name='series'):
    result = adfuller(series, autolag='AIC')
    print(f"{name}: ADF stat={result[0]:.4f}, p-value={result[1]:.6f}")
    return result[1] < 0.05  # stationary if p < 0.05

# Raw prices: almost certainly non-stationary
check_stationarity(close_prices, 'Raw prices')
# Output: Raw prices: ADF stat=-1.2341, p-value=0.659823

# First difference: should be stationary
price_diff = close_prices.diff().dropna()
check_stationarity(price_diff, 'First difference')
# Output: First difference: ADF stat=-32.1876, p-value=0.000000

The Augmented Dickey-Fuller test quantifies what your eyes already suspect when looking at a price chart. A p-value of 0.66 on raw prices means we can’t reject the null hypothesis of a unit root (non-stationarity). After differencing, that p-value drops to essentially zero. The test statistic τ\tau comes from regressing the differenced series on its lagged value: Δyt=α+βt+γyt1+i=1pδiΔyti+ϵt\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta y_{t-i} + \epsilon_t, where we’re testing whether γ=0\gamma = 0.

ARIMA: The Classical Workhorse

ARIMA(p, d, q) combines three components: autoregression on p lagged values, d orders of differencing, and a moving average of q lagged forecast errors. The model for the differenced series looks like:

Δdyt=c+i=1pϕiΔdyti+j=1qθjϵtj+ϵt\Delta^d y_t = c + \sum_{i=1}^{p} \phi_i \Delta^d y_{t-i} + \sum_{j=1}^{q} \theta_j \epsilon_{t-j} + \epsilon_t

Finding the right (p, d, q) combination used to be an art involving ACF and PACF plots. These days, pmdarima‘s auto_arima handles the grunt work through information criterion optimization (AIC/BIC).

import pmdarima as pm
from sklearn.metrics import mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')  # pmdarima throws a lot of ConvergenceWarnings

# Split data: 80% train, 20% test
train_size = int(len(close_prices) * 0.8)
train_prices = close_prices[:train_size]
test_prices = close_prices[train_size:]

print(f"Training on {len(train_prices)} days, testing on {len(test_prices)} days")
# Output: Training on 2012 days, testing on 503 days

# Auto ARIMA - this takes a while on 2000+ observations
arima_model = pm.auto_arima(
    train_prices,
    start_p=0, start_q=0,
    max_p=5, max_q=5,
    d=1,  # we know we need one difference
    seasonal=False,  # daily gold prices don't have strong seasonality
    stepwise=True,
    suppress_warnings=True,
    error_action='ignore',
    trace=False
)

print(f"Selected model: ARIMA{arima_model.order}")
print(f"AIC: {arima_model.aic():.2f}")
# Output: Selected model: ARIMA(2, 1, 2)
# Output: AIC: 14892.31

The auto_arima function settled on ARIMA(2, 1, 2) for my dataset. Your mileage will vary depending on the exact date range—I’ve seen it pick anywhere from (1, 1, 1) to (3, 1, 3) on different gold price windows. This sensitivity to the training period is one of ARIMA’s underappreciated weaknesses.

# Generate forecasts for the test period
# Note: ARIMA forecasts degrade quickly, so we'll use rolling predictions
arima_predictions = []
history = list(train_prices)

for i in range(len(test_prices)):
    # Fit on history, predict one step
    model = pm.ARIMA(order=arima_model.order)
    model.fit(history)
    pred = model.predict(n_periods=1)[0]
    arima_predictions.append(pred)

    # Add actual value to history for next iteration
    history.append(test_prices.iloc[i])

    if i % 100 == 0:
        print(f"Progress: {i}/{len(test_prices)}")

arima_predictions = np.array(arima_predictions)

# Calculate metrics
arima_rmse = np.sqrt(mean_squared_error(test_prices, arima_predictions))
arima_mae = mean_absolute_error(test_prices, arima_predictions)
print(f"ARIMA RMSE: ${arima_rmse:.2f}, MAE: ${arima_mae:.2f}")
# Output: ARIMA RMSE: $18.42, MAE: $13.21

That rolling forecast loop is computationally expensive—refitting ARIMA at each step took about 4 minutes on my M2 MacBook for 500 test points. There’s a way to update ARIMA incrementally without full refitting, but statsmodels’ implementation has some quirks that make it unreliable in my experience.

LSTM: The Neural Alternative

Long Short-Term Memory networks (Hochreiter & Schmidhuber, 1997) handle sequences through a gating mechanism that controls information flow. The core update equations involve forget, input, and output gates:

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
Ct=ftCt1+itC~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

The forget gate ftf_t decides what to discard from the cell state, while the input gate iti_t controls what new information to store. This architecture theoretically lets the network learn long-range dependencies that ARIMA’s linear structure can’t capture.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler

# Prepare data for LSTM - needs to be scaled and sequenced
scaler = MinMaxScaler(feature_range=(0, 1))

# Scale on training data only to avoid data leakage
train_scaled = scaler.fit_transform(train_prices.values.reshape(-1, 1))
test_scaled = scaler.transform(test_prices.values.reshape(-1, 1))

# Create sequences: use past 60 days to predict next day
def create_sequences(data, lookback=60):
    X, y = [], []
    for i in range(lookback, len(data)):
        X.append(data[i-lookback:i, 0])
        y.append(data[i, 0])
    return np.array(X), np.array(y)

# Combine train and test for sequence creation, then split
all_scaled = np.vstack([train_scaled, test_scaled])
X_all, y_all = create_sequences(all_scaled, lookback=60)

# Split back - account for the lookback offset
split_idx = len(train_scaled) - 60
X_train, y_train = X_all[:split_idx], y_all[:split_idx]
X_test, y_test = X_all[split_idx:], y_all[split_idx:]

# Reshape for LSTM: [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

print(f"Training shape: {X_train.shape}, Test shape: {X_test.shape}")
# Output: Training shape: (1952, 60, 1), Test shape: (503, 60, 1)

The 60-day lookback window is somewhat arbitrary. I’ve tried values from 20 to 120, and 60 consistently performs well without excessive computational cost. My best guess is that it captures roughly one market quarter’s worth of patterns, though I haven’t done a rigorous hyperparameter search to confirm this.

# Build a fairly standard LSTM architecture
tf.random.set_seed(42)

model = Sequential([
    LSTM(50, return_sequences=True, input_shape=(60, 1)),
    Dropout(0.2),
    LSTM(50, return_sequences=False),
    Dropout(0.2),
    Dense(25, activation='relu'),
    Dense(1)
])

model.compile(optimizer='adam', loss='mse')
model.summary()

# Train with early stopping
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks=[early_stop],
    verbose=1
)
# Training stopped at epoch 47 on my run

A warning about TensorFlow versions: this code runs on TF 2.15, but if you’re on 2.12 or earlier, you might see different convergence behavior due to changes in the Adam optimizer defaults. The restore_best_weights=True flag is crucial—without it, early stopping gives you the weights from the final epoch, not the best one.

# Generate predictions
lstm_predictions_scaled = model.predict(X_test, verbose=0)
lstm_predictions = scaler.inverse_transform(lstm_predictions_scaled).flatten()

# The test set is slightly shorter due to the lookback window alignment
# We need to compare against the same indices
test_actual = test_prices.iloc[:len(lstm_predictions)].values

lstm_rmse = np.sqrt(mean_squared_error(test_actual, lstm_predictions))
lstm_mae = mean_absolute_error(test_actual, lstm_predictions)
print(f"LSTM RMSE: ${lstm_rmse:.2f}, MAE: ${lstm_mae:.2f}")
# Output: LSTM RMSE: $31.87, MAE: $24.53

Wait, 31.87RMSEversusARIMAs31.87 RMSE versus ARIMA's18.42? That’s almost 75% worse.

Where Each Model Breaks Down

The one-step-ahead results favor ARIMA, but that’s not the full picture. Financial forecasting cares about multi-step horizons—can you predict where prices will be in a week? A month?

# Multi-step forecast comparison
def arima_multistep_forecast(train_data, n_steps):
    model = pm.auto_arima(train_data, d=1, seasonal=False, 
                          suppress_warnings=True, error_action='ignore')
    return model.predict(n_periods=n_steps)

def lstm_multistep_forecast(model, last_sequence, scaler, n_steps):
    predictions = []
    current_seq = last_sequence.copy()

    for _ in range(n_steps):
        pred = model.predict(current_seq.reshape(1, -1, 1), verbose=0)[0, 0]
        predictions.append(pred)
        # Roll the sequence forward
        current_seq = np.roll(current_seq, -1)
        current_seq[-1] = pred

    return scaler.inverse_transform(np.array(predictions).reshape(-1, 1)).flatten()

# Test on 7-day and 30-day horizons
test_start_idx = train_size
last_train_seq = train_scaled[-60:, 0]

# 7-day forecast
arima_7d = arima_multistep_forecast(train_prices, 7)
lstm_7d = lstm_multistep_forecast(model, last_train_seq, scaler, 7)
actual_7d = close_prices.iloc[test_start_idx:test_start_idx+7].values

print("7-day forecast RMSE:")
print(f"  ARIMA: ${np.sqrt(mean_squared_error(actual_7d, arima_7d)):.2f}")
print(f"  LSTM:  ${np.sqrt(mean_squared_error(actual_7d, lstm_7d)):.2f}")
# Output:
# 7-day forecast RMSE:
#   ARIMA: $21.34
#   LSTM:  $28.91

# 30-day forecast
arima_30d = arima_multistep_forecast(train_prices, 30)
lstm_30d = lstm_multistep_forecast(model, last_train_seq, scaler, 30)
actual_30d = close_prices.iloc[test_start_idx:test_start_idx+30].values

print("30-day forecast RMSE:")
print(f"  ARIMA: ${np.sqrt(mean_squared_error(actual_30d, arima_30d)):.2f}")
print(f"  LSTM:  ${np.sqrt(mean_squared_error(actual_30d, lstm_30d)):.2f}")
# Output:
# 30-day forecast RMSE:
#   ARIMA: $89.23
#   LSTM:  $52.17

Now that’s interesting. ARIMA’s error explodes at 30 days while LSTM degrades more gracefully. The 89versus89 versus52 difference represents about 4-5% of the gold price—meaningful in trading terms.

Why does ARIMA fall apart on longer horizons? The forecast variance for an ARIMA(p, d, q) model grows without bound as the horizon increases. For a random walk with drift (which gold prices roughly follow), the forecast standard deviation grows proportionally to h\sqrt{h} where hh is the horizon. After 30 steps, uncertainty compounds dramatically, and ARIMA’s point forecasts converge toward a flat line at the last observed value.

LSTM doesn’t have this specific mathematical limitation, but it has its own problems. The recursive prediction approach—feeding predictions back as inputs—accumulates errors. I’m not entirely sure why LSTM handles this better than ARIMA for gold specifically; my best guess involves the network learning some nonlinear mean-reversion patterns that help anchor longer forecasts.

The Volatility Blind Spot

Neither model handles volatility clustering well. Financial time series exhibit heteroskedasticity—periods of high volatility cluster together, as do calm periods. The standard way to model this is GARCH (Generalized Autoregressive Conditional Heteroskedasticity, Bollerslev 1986), which explicitly models the variance process:

σt2=ω+i=1qαiϵti2+j=1pβjσtj2\sigma_t^2 = \omega + \sum_{i=1}^{q} \alpha_i \epsilon_{t-i}^2 + \sum_{j=1}^{p} \beta_j \sigma_{t-j}^2

from arch import arch_model

# Fit GARCH(1,1) on returns
returns = close_prices.pct_change().dropna() * 100  # percentage returns

garch = arch_model(returns[:train_size], vol='Garch', p=1, q=1)
garch_fit = garch.fit(disp='off')
print(garch_fit.summary().tables[1])

Combining ARIMA for the mean equation with GARCH for variance gives you ARIMA-GARCH, which is theoretically more appropriate for financial data. But here’s a dirty secret: in my experiments, the point forecast accuracy barely improved. GARCH shines when you need prediction intervals or volatility estimates for risk management, not when you just want to minimize RMSE on price predictions.

And LSTM? You could feed volatility features into the network, but that adds complexity for marginal gains on this dataset. The 10-year gold price series we’re working with doesn’t have enough extreme volatility events for the network to learn robust volatility patterns.

The Preprocessing Detail Everyone Ignores

Remember that MinMaxScaler we applied before LSTM training? The scaling boundaries matter more than most tutorials acknowledge.

# A subtle but important issue
print(f"Training price range: ${train_prices.min():.2f} - ${train_prices.max():.2f}")
print(f"Test price range: ${test_prices.min():.2f} - ${test_prices.max():.2f}")
# Output:
# Training price range: $1049.40 - $1987.92
# Test price range: $1683.25 - $2089.67

# The test set contains values OUTSIDE the training range!
test_above_train_max = (test_prices > train_prices.max()).sum()
print(f"Test points above training max: {test_above_train_max}")
# Output: Test points above training max: 147

When test prices exceed the training maximum, the scaler maps them to values above 1.0, which the network never saw during training. This contributes to LSTM’s higher error—it’s extrapolating in a region where it has no experience. One workaround is using RobustScaler or manually setting wider bounds, but there’s no perfect solution when the test distribution genuinely differs from training.

Picking a Winner

For short-term gold price forecasting (1-7 days), use ARIMA. It’s simpler, faster to train, doesn’t require GPU resources, and produces better point predictions on this dataset. The rolling refit approach is computationally annoying but manageable for reasonable test set sizes.

For horizons beyond two weeks, LSTM becomes competitive despite its higher one-step error. If you’re building a system that needs 30-day forecasts—maybe for quarterly planning or longer-term hedging decisions—the neural approach handles error accumulation better.

But here’s the uncomfortable truth neither model addresses: gold prices are influenced by macroeconomic factors, geopolitical events, central bank policies, and currency movements that don’t appear in historical price data alone. The efficient market hypothesis suggests that publicly available historical patterns are already priced in, making pure technical forecasting a game of diminishing returns. I’d be cautious about deploying either model for actual trading without incorporating external features.

In Part 3, we’ll build a more sophisticated deep learning architecture that incorporates multiple input features—not just past prices but trading volume, volatility indicators, and potentially external economic data. The question I haven’t resolved yet: does adding features genuinely improve forecasts, or does it just add noise and overfitting opportunities? My preliminary experiments suggest the answer depends heavily on how you handle feature selection, but I haven’t done rigorous enough testing to make strong claims.

Gold Price Forecasting with Data Analysis and Deep Learning Series (2/4)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 396 | TOTAL 2,619