Production-Ready Gold Price Prediction: From Jupyter Notebook to REST API

⚡ Key Takeaways
  • Production ML requires saving not just the model but also scalers, sequence lengths, and feature metadata to avoid silent failures during inference.
  • Drift monitoring tracks both prediction error (MAE) and input feature statistics to detect when models decay or data pipelines break.
  • FastAPI enables rapid API development with automatic validation, async support, and built-in docs, while A/B testing lets you safely deploy new model versions in parallel.
  • Real production systems need health checks, logging with inference time tracking, model versioning, and a plan for automated retraining as data evolves.

The Model Works in Your Notebook — Now What?

You’ve trained an LSTM that beats naive baselines on gold price forecasting (we covered that in Part 3). The validation loss looks good, the walk-forward test is passable, and you’ve convinced yourself the model isn’t just memorizing noise. But here’s the uncomfortable truth: nobody uses models from .ipynb files. Production means wrapping that TensorFlow graph in something that answers HTTP requests, handles bad input gracefully, and doesn’t crash when your data pipeline hiccups at 3 AM.

This part covers the messy middle ground between “it works on my laptop” and “it’s running in production.” We’ll build a REST API for gold price prediction, add model versioning, handle drift detection, and package everything so you can actually deploy it. No Docker-Kubernetes-CI/CD sermon here — just the scaffolding that makes a model usable outside your notebook.

Visual abstraction of neural networks in AI technology, featuring data flow and algorithms.
Photo by Google DeepMind on Pexels

Exporting the Model (and Everything Else You Forgot)

The obvious first step: save the trained model. TensorFlow makes this easy with model.save(), but here’s what breaks if you only save the model file.

Your LSTM expects input shaped like (batch_size, sequence_length, num_features). During training, you probably hardcoded sequence_length=30 (30 days of lookback) somewhere in your notebook. When you load the model six months later to make a prediction, do you remember that number? What about the feature scaling parameters — the mean and standard deviation you used to normalize prices? If you lose those, your predictions will be garbage because the model expects inputs in the range it was trained on, not raw dollar values.

Save everything together:

import joblib
import numpy as np
from pathlib import Path

class GoldPriceModel:
    def __init__(self, model, scaler, sequence_length=30, feature_cols=None):
        self.model = model
        self.scaler = scaler
        self.sequence_length = sequence_length
        self.feature_cols = feature_cols or ['Open', 'High', 'Low', 'Close', 'Volume']

    def save(self, model_dir):
        model_dir = Path(model_dir)
        model_dir.mkdir(parents=True, exist_ok=True)

        # Save TensorFlow model
        self.model.save(model_dir / 'tf_model')

        # Save metadata as joblib (scaler is sklearn, not TF)
        metadata = {
            'scaler': self.scaler,
            'sequence_length': self.sequence_length,
            'feature_cols': self.feature_cols,
            'saved_at': np.datetime64('now')
        }
        joblib.dump(metadata, model_dir / 'metadata.pkl')

    @classmethod
    def load(cls, model_dir):
        from tensorflow import keras
        model_dir = Path(model_dir)

        model = keras.models.load_model(model_dir / 'tf_model')
        metadata = joblib.load(model_dir / 'metadata.pkl')

        return cls(
            model=model,
            scaler=metadata['scaler'],
            sequence_length=metadata['sequence_length'],
            feature_cols=metadata['feature_cols']
        )

Nothing fancy — just bundling the pieces you’ll need later. The saved_at timestamp is a small touch that saved me once when I had three model directories and couldn’t remember which was which.

Prediction Pipeline: Handling Messy Inputs

Here’s what the prediction function needs to do: take the last N days of gold price data (probably from a database or CSV), transform it into the format your LSTM expects, run inference, and return a forecast. Sounds simple until you encounter real data.

import pandas as pd

class GoldPriceModel:
    # ... (previous methods)

    def predict_next_day(self, recent_data):
        """
        recent_data: DataFrame with columns matching self.feature_cols,
                     sorted by date (oldest to newest)
        Returns: predicted closing price for next day
        """
        if len(recent_data) < self.sequence_length:
            raise ValueError(
                f"Need at least {self.sequence_length} days of data, "
                f"got {len(recent_data)}"
            )

        # Take only the most recent sequence_length rows
        recent_data = recent_data.tail(self.sequence_length)

        # Extract features in the correct order
        X = recent_data[self.feature_cols].values

        # Edge case: if there's a NaN (missing data), this will propagate
        # through scaler and model. Better to fail loudly here.
        if np.isnan(X).any():
            raise ValueError("Input data contains NaN values")

        # Scale using the same scaler from training
        X_scaled = self.scaler.transform(X)

        # Reshape for LSTM: (1, sequence_length, num_features)
        X_scaled = X_scaled.reshape(1, self.sequence_length, len(self.feature_cols))

        # Predict (returns scaled value)
        pred_scaled = self.model.predict(X_scaled, verbose=0)[0, 0]

        # Inverse transform to get actual price
        # Assuming we only scaled 'Close', need to pad other features
        dummy = np.zeros((1, len(self.feature_cols)))
        close_idx = self.feature_cols.index('Close')
        dummy[0, close_idx] = pred_scaled
        pred_price = self.scaler.inverse_transform(dummy)[0, close_idx]

        return float(pred_price)

That inverse transform is ugly. My best guess is there’s a cleaner way using sklearn.compose.ColumnTransformer, but I haven’t tested it thoroughly. The dummy array hack works: you create a fake row with the scaled prediction in the right column position, inverse-transform the whole thing, then extract just the price. It’s the kind of code you write at 11 PM and promise to refactor later (you won’t).

The NaN check is critical. In production, you’ll eventually get incomplete data — a missing volume field, a date with no trades, whatever. Better to return a 400 error than silently predict nonsense.

REST API with FastAPI

FastAPI has become the default choice for Python ML APIs, and for good reason: automatic request validation, async support, and built-in OpenAPI docs. Here’s a minimal implementation:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, validator
from typing import List
from datetime import date
import uvicorn

app = FastAPI(title="Gold Price Prediction API")

# Load model at startup (not per-request)
model = GoldPriceModel.load('models/production_v1')

class PriceData(BaseModel):
    date: date
    open: float
    high: float
    low: float
    close: float
    volume: int

    @validator('high')
    def high_above_low(cls, v, values):
        if 'low' in values and v < values['low']:
            raise ValueError('high must be >= low')
        return v

class PredictionRequest(BaseModel):
    recent_prices: List[PriceData]

class PredictionResponse(BaseModel):
    predicted_close: float
    model_version: str
    prediction_date: date

@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
    try:
        # Convert to DataFrame
        df = pd.DataFrame([p.dict() for p in request.recent_prices])
        df = df.rename(columns={
            'open': 'Open', 'high': 'High', 'low': 'Low',
            'close': 'Close', 'volume': 'Volume'
        })
        df = df.sort_values('date')

        pred = model.predict_next_day(df)

        return PredictionResponse(
            predicted_close=round(pred, 2),
            model_version='v1',
            prediction_date=df['date'].max() + pd.Timedelta(days=1)
        )
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        # Log this in production, don't expose internals
        raise HTTPException(status_code=500, detail="Prediction failed")

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run this with python api.py, then hit http://localhost:8000/docs to see the auto-generated Swagger UI. You can test predictions directly from the browser — it’s shockingly polished for zero effort.

The @validator decorator catches obviously broken data (high < low), which happens more often than you’d think when people manually enter prices or scrape them from sketchy sources. Model version tracking is just a hardcoded string here, but in a real system you’d tie this to Git commits or a model registry like MLflow.

Model Versioning and A/B Testing

Six months from now, you’ll retrain your model on fresh data. How do you safely deploy the new version without breaking things if it’s worse?

The simplest approach: run both models in parallel and return predictions from both. Let the client decide which to use, or log both for comparison.

models = {
    'v1': GoldPriceModel.load('models/production_v1'),
    'v2': GoldPriceModel.load('models/production_v2')
}

class PredictionResponse(BaseModel):
    predictions: dict  # {'v1': 2145.32, 'v2': 2148.10}
    default_version: str

@app.post("/predict", response_model=PredictionResponse)
def predict(request: PredictionRequest):
    df = pd.DataFrame([p.dict() for p in request.recent_prices])
    # ... (same preprocessing)

    preds = {}
    for version, model in models.items():
        try:
            preds[version] = round(model.predict_next_day(df), 2)
        except Exception as e:
            # If one model fails, still return the other
            preds[version] = None

    return PredictionResponse(
        predictions=preds,
        default_version='v1'  # Or choose based on recent performance
    )

This lets you gradually shift traffic to v2 by changing the default_version field, or let users opt-in with a query parameter. If v2 starts producing crazy predictions, you haven’t burned the bridge back to v1.

A more sophisticated setup would track which version was used for each prediction and compare against actual prices the next day. That’s the start of a monitoring system, which brings us to…

Drift Detection: When Your Model Stops Working

Time series models decay. The relationships your LSTM learned in 2020-2023 data might not hold in 2026 if market dynamics shift (Fed policy changes, crypto correlation weirdness, geopolitical shocks). How do you know when to retrain?

The metric you care about is prediction error on recent data. Track the mean absolute error (MAE) over a rolling window:

MAE=1Ni=1Nyiy^i\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i – \hat{y}_i|

where yiy_i is the actual closing price and y^i\hat{y}_i is your prediction from the previous day. If MAE starts climbing, your model is drifting.

from collections import deque
import numpy as np

class DriftMonitor:
    def __init__(self, window_size=30, threshold_multiplier=2.0):
        self.window_size = window_size
        self.errors = deque(maxlen=window_size)
        self.threshold_multiplier = threshold_multiplier
        self.baseline_mae = None  # Set this from validation set

    def add_error(self, actual, predicted):
        error = abs(actual - predicted)
        self.errors.append(error)

    def current_mae(self):
        if len(self.errors) == 0:
            return None
        return np.mean(self.errors)

    def is_drifting(self):
        if self.baseline_mae is None or len(self.errors) < self.window_size:
            return False

        current = self.current_mae()
        # Alert if current MAE is 2x the baseline
        return current > self.baseline_mae * self.threshold_multiplier

# In production, after each prediction:
monitor = DriftMonitor()
monitor.baseline_mae = 8.5  # from your validation set

# Next day, when actual price is known:
monitor.add_error(actual=2150.0, predicted=2145.32)
if monitor.is_drifting():
    # Send alert, trigger retraining pipeline, etc.
    print(f"DRIFT ALERT: MAE={monitor.current_mae():.2f}")

The threshold multiplier is a guess. 2.0 seems reasonable (“error doubled”) but you might need to tune it based on how volatile gold prices are. Too sensitive and you’ll retrain constantly; too loose and you’ll serve stale predictions for weeks.

What’s missing here: statistical tests for distribution shift (Kolmogorov-Smirnov test on prediction residuals, Population Stability Index on input features). I haven’t implemented those because they’re overkill for a simple gold price model, but if you’re predicting something that affects real money, look into them.

Input Feature Drift

Prediction error drift catches when your model is wrong. But sometimes the problem is upstream: your data source changes format, a field starts returning nulls, or someone “fixes” the volume calculation and now all your volumes are 10x what they used to be.

Check basic statistics on incoming data:

class FeatureDriftMonitor:
    def __init__(self, expected_stats):
        """
        expected_stats: dict like {'Close': {'mean': 1850, 'std': 120}, ...}
        """
        self.expected = expected_stats

    def check(self, df):
        alerts = []
        for col, stats in self.expected.items():
            actual_mean = df[col].mean()
            actual_std = df[col].std()

            # Flag if mean shifted by >3 std deviations
            z_score = abs(actual_mean - stats['mean']) / stats['std']
            if z_score > 3:
                alerts.append(f"{col} mean shifted: {actual_mean:.2f} vs expected {stats['mean']:.2f}")

            # Flag if std changed drastically
            if actual_std > 2 * stats['std'] or actual_std < 0.5 * stats['std']:
                alerts.append(f"{col} std changed: {actual_std:.2f} vs expected {stats['std']:.2f}")

        return alerts

# Usage in API:
feature_monitor = FeatureDriftMonitor({
    'Close': {'mean': 1850, 'std': 120},
    'Volume': {'mean': 150000, 'std': 50000}
})

@app.post("/predict")
def predict(request: PredictionRequest):
    df = pd.DataFrame([p.dict() for p in request.recent_prices])

    # Check for feature drift before predicting
    alerts = feature_monitor.check(df)
    if alerts:
        # Log alerts, maybe return them in response
        print("Feature drift detected:", alerts)

    # ... (rest of prediction logic)

This won’t catch subtle shifts, but it’ll scream if your data pipeline suddenly breaks. I’ve seen this save hours of debugging when a scraper started returning prices in cents instead of dollars (mean jumped from 1850 to 185000 — pretty obvious).

Logging and Observability

You need to log every prediction with enough context to debug later. At minimum: timestamp, input data hash, model version, prediction value, and how long inference took.

import logging
import time
import hashlib
import json

logging.basicConfig(
    filename='predictions.log',
    level=logging.INFO,
    format='%(asctime)s - %(message)s'
)

@app.post("/predict")
def predict(request: PredictionRequest):
    start_time = time.time()

    df = pd.DataFrame([p.dict() for p in request.recent_prices])
    # ... (preprocessing)

    pred = model.predict_next_day(df)

    # Log everything
    input_hash = hashlib.md5(
        json.dumps([p.dict() for p in request.recent_prices], sort_keys=True).encode()
    ).hexdigest()

    logging.info(json.dumps({
        'input_hash': input_hash,
        'model_version': 'v1',
        'prediction': pred,
        'inference_time_ms': (time.time() - start_time) * 1000,
        'input_length': len(request.recent_prices)
    }))

    return PredictionResponse(predicted_close=pred, model_version='v1')

Inference time matters. If your API suddenly takes 5 seconds instead of 50ms, something’s wrong (model loaded from disk instead of RAM, TensorFlow defaulting to CPU, etc.). On my M1 MacBook, inference for this LSTM is ~20ms with TensorFlow 2.15. In production on a cheap AWS instance without GPU, it’s more like 100-150ms, which is still fine for gold price prediction (nobody needs sub-second forecasts).

Deployment Checklist

Before you call this “production-ready”:

1. Environment pinning: requirements.txt with exact versions (tensorflow==2.15.0, not tensorflow>=2.0). I’ve been burned by minor version bumps changing model behavior.

2. Health checks: Add a /health endpoint that loads the model and runs a dummy prediction. If that fails, your orchestrator can restart the container.

@app.get("/health")
def health_check():
    try:
        dummy_data = pd.DataFrame({
            'Open': [1800] * 30,
            'High': [1810] * 30,
            'Low': [1790] * 30,
            'Close': [1805] * 30,
            'Volume': [100000] * 30,
            'date': pd.date_range('2024-01-01', periods=30)
        })
        _ = model.predict_next_day(dummy_data)
        return {"status": "healthy"}
    except Exception as e:
        raise HTTPException(status_code=503, detail="Model not ready")

3. Rate limiting: Unless you want someone to hammer your API 1000 times/sec. slowapi is a simple middleware for this.

4. HTTPS: Don’t serve predictions over plain HTTP if you care about integrity. Let’s Encrypt makes this free and easy.

5. Model artifact storage: Don’t bake the model into your Docker image (it bloats the image and forces rebuild every time you retrain). Store it in S3 or equivalent, download on container startup.

6. Secrets management: If your API talks to a database or external APIs, don’t hardcode credentials. Use environment variables at minimum, or something like AWS Secrets Manager.

What About Batch Predictions?

The API above handles one prediction at a time. If you need to forecast the next 7 days, or generate predictions for 100 different assets, you’ll want batch support.

For multi-day forecasting, the naive approach: predict day 1, append it to your input sequence, predict day 2, repeat. This compounds error — your day 7 forecast is based on 6 layers of previous predictions, all of which might be wrong.

A better (but more complex) approach: train a model that directly outputs a sequence. Your LSTM returns a vector of length 7 instead of a scalar. The loss function becomes:

L=1Tt=1T(yty^t)2L = \frac{1}{T} \sum_{t=1}^{T} (y_t – \hat{y}_t)^2

where T=7T=7 is your forecast horizon. This forces the model to learn the joint distribution of the next 7 days, not just the next-day marginal. I haven’t tested this on gold prices specifically, so take it with a grain of salt, but it’s the standard approach in demand forecasting and weather prediction.

For batch predictions across assets (gold, silver, platinum, etc.), just parallelize. FastAPI supports async, so you can run multiple models concurrently:

import asyncio

@app.post("/batch_predict")
async def batch_predict(requests: List[PredictionRequest]):
    async def predict_one(req):
        # Convert to sync call (TensorFlow isn't async-native)
        return await asyncio.to_thread(model.predict_next_day, req)

    predictions = await asyncio.gather(*[predict_one(req) for req in requests])
    return predictions

This won’t speed up inference if you’re CPU-bound (which you probably are), but it’ll help if you’re waiting on I/O (database fetches, external APIs).

When This Approach Breaks Down

Everything here assumes your model is small enough to fit in RAM and fast enough to run on-demand. For gold prices with a simple LSTM, that’s true. But if you’re training a massive Transformer with 100M parameters, or doing ensemble predictions with 50 models, you’ll need a different architecture.

Options:
– Precompute predictions overnight, serve from cache
– Use TensorFlow Serving or TorchServe (optimized inference servers)
– Quantize your model (TensorFlow Lite, ONNX with int8)
– Offload to GPU (but now you need GPU instances, cost goes up)

For financial forecasting specifically, there’s also the latency vs. accuracy tradeoff. A 100ms API response is fine for daily gold price prediction, but if you’re doing high-frequency trading, you need sub-millisecond. At that point you’re writing C++ and co-locating servers with exchanges, which is a different universe.

The Part Nobody Talks About: Maintaining This

You deploy your model, it works great for two months, then predictions start drifting. You retrain, deploy v2, and now you have two models to maintain. Six months later you have v1 through v5, none of which you remember the difference between, and the monitoring system is sending drift alerts every week because gold prices got weirdly volatile.

This is the real challenge. The code above gets you to production, but staying in production means:
– Automated retraining pipeline (new data arrives, model retrains, performance is validated against holdout set, if better than current version it auto-deploys)
– Experiment tracking (MLflow, Weights & Biases, or just a spreadsheet tracking which hyperparameters produced which validation MAE)
– Data versioning (DVC or similar, so you can reproduce a model trained 6 months ago)
– Incident response (your model predicts gold will hit $5000 tomorrow — is that a bug or did something crazy happen? You need a process to check.)

I don’t have all of this set up for the gold price model because it’s a demo, not a product. But if you’re putting real money behind these predictions, you need at least half of it.

Where to Go From Here

Use FastAPI for the API layer — it’s fast, well-documented, and the auto-generated docs are genuinely useful. Add drift monitoring from day one, even if it’s just logging MAE. Version your models explicitly (Git commit hash, training date, whatever) so you can debug later. And don’t skip the health check endpoint; it’s 5 lines of code that will save you during an outage.

If you’re forecasting something that actually matters (not just gold prices for a blog post), invest in experiment tracking and automated retraining. The first version of your model will not be the last, and manual retraining is a time sink you’ll grow to hate. What I’m curious about but haven’t tried yet: online learning for time series — updating the model incrementally as new data arrives instead of full retraining. The theory says it should work (just backprop on the new samples), but I’m not sure how well it handles distribution shift compared to a fresh retrain on the full recent window. If you’ve done this at scale, I’d love to hear how it went.

Gold Price Forecasting with Data Analysis and Deep Learning Series (4/4)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 324 | TOTAL 2,547