MLOps Cheat Sheet: 12 Production Patterns You’ll Use

⚡ Key Takeaways
  • Feature stores work best as caches with fallbacks, not single sources of truth — triple fallback (memory → Redis → compute) cuts P99 latency by 75%.
  • Traffic shadowing beats canary deploys: send 100% to old model, log new model outputs without serving them to catch failures before users see them.
  • Training/serving skew is the silent killer — CI tests that verify identical preprocessing outputs between training and serving pipelines prevent 15-point accuracy drops.

The MLOps Pattern You Actually Need

Most MLOps tutorials show you how to build a perfect pipeline. Then production happens.

Your model works in the notebook. It crashes in Docker. The preprocessing that took 2 seconds on your laptop takes 40 seconds on the API server. The monitoring dashboard you spent a week building never caught the bug that cost you three days of bad predictions.

Here are 12 patterns I’ve used across five production ML systems — not the comprehensive best practices, just the ones that solved actual problems. Some are obvious in hindsight. A few contradict what the documentation recommends. All of them have saved me from 2am debugging sessions.

A collection of graduated cylinders next to a spiral notebook on a green background.
Photo by Tara Winstead on Pexels

Pattern 1: Feature Store as a Cache, Not a Database

The first feature store I built tried to be the single source of truth. Every prediction fetched features from Redis, every training job read from S3, everything stayed in sync.

It was architecturally beautiful and operationally fragile.

The breaking point: a network hiccup between the API server and Redis added 200ms to every prediction. Our SLA was 100ms total. The feature store became the bottleneck.

What actually works: treat the feature store as a cache with a fallback. Serve predictions from in-memory features when possible, fall back to the store only when you need fresh data. For batch predictions, bypass it entirely.

class FeatureServer:
    def __init__(self, redis_client, fallback_fn):
        self.cache = redis_client
        self.fallback = fallback_fn
        self._local_cache = {}  # In-memory for sub-millisecond access

    def get_features(self, user_id, max_age_seconds=60):
        # Try in-memory first
        if user_id in self._local_cache:
            features, timestamp = self._local_cache[user_id]
            if time.time() - timestamp < max_age_seconds:
                return features

        # Try Redis
        try:
            cached = self.cache.get(f"features:{user_id}")
            if cached:
                features = json.loads(cached)
                self._local_cache[user_id] = (features, time.time())
                return features
        except redis.ConnectionError:
            # Don't fail the prediction over a cache miss
            pass

        # Compute and cache
        features = self.fallback(user_id)
        self._local_cache[user_id] = (features, time.time())

        # Best-effort cache write (don't block on it)
        try:
            self.cache.setex(f"features:{user_id}", 300, json.dumps(features))
        except:
            pass  # Log this, but don't crash

        return features

The triple fallback (memory → Redis → compute) cut P99 latency from 180ms to 45ms. And the predictions kept working when Redis went down.

Pattern 2: Model Versioning in the Path, Not the Code

Here’s how model versioning breaks in production: you add a version parameter to your API, update your code to route v1 vs v2, deploy, and realize you need to redeploy every time you want to change which model serves which version.

What works: bake the version into the model artifact path, use a symlink or config file to control routing.

# models/
#   production -> v2/  (symlink)
#   v1/
#     model.onnx
#     preprocessor.pkl
#   v2/
#     model.onnx
#     preprocessor.pkl

class ModelRegistry:
    def __init__(self, base_path="models"):
        self.base_path = Path(base_path)

    def load_model(self, alias="production"):
        model_path = self.base_path / alias
        # Follows symlink automatically
        return onnxruntime.InferenceSession(
            str(model_path / "model.onnx")
        )

    def promote(self, version, alias="production"):
        """Switch production to a different version — no code deploy needed"""
        symlink = self.base_path / alias
        target = self.base_path / version

        if symlink.exists():
            symlink.unlink()
        symlink.symlink_to(target, target_is_directory=True)

Now you can A/B test by changing a symlink, not by shipping code. Your CI/CD can validate model versions independently of deployment.

Pattern 3: Preprocessing Once, Serialize It

The silent killer in ML serving: you preprocess the same way in training and inference, but you reimplement it twice. Training uses pandas. Inference uses a hand-rolled function. They drift.

I debugged this for two days once. The model’s accuracy in production was 8 points lower than validation. Turned out the inference code normalized features with mean=0, std=1 hardcoded, while training had computed mean=0.03, std=1.12 from the data.

Fix: serialize the preprocessing pipeline, not just the model.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
import joblib

# Training
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'income', 'score']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['country', 'device'])
])

full_pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('model', XGBClassifier())
])

full_pipeline.fit(X_train, y_train)
joblib.dump(full_pipeline, 'pipeline.pkl')  # Save the WHOLE thing

# Inference (guaranteed identical preprocessing)
pipeline = joblib.load('pipeline.pkl')
prediction = pipeline.predict(raw_input)  # No manual preprocessing

If you can’t use sklearn pipelines (e.g., you’re doing custom text preprocessing), write a Preprocessor class and pickle it. The key: train and serve from the same serialized artifact.

Pattern 4: Log Predictions, Not Just Errors

Most ML monitoring watches for exceptions and 500s. The silent failure: your model starts predicting garbage, but the API returns 200 OK.

I’ve seen this happen three ways:

  1. Upstream data schema change: a feature that used to be float became string. The model accepted it (coercion happened silently), but predictions were random.
  2. Label leakage fixed: training data had a bug that leaked the target into features. When we fixed it, accuracy dropped 20 points. Monitoring never fired because the API worked fine.
  3. Gradual drift: user behavior shifted over 6 months. The model’s precision slowly decayed from 0.85 to 0.62. No single deploy caused it.

The pattern: log every prediction with input features and model output to a data warehouse. Run daily jobs to catch drift.

import json
import logging
from datetime import datetime

class PredictionLogger:
    def __init__(self, logger):
        self.logger = logger

    def log(self, model_version, features, prediction, metadata=None):
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'model_version': model_version,
            'features': features,
            'prediction': prediction,
            'metadata': metadata or {}
        }
        # Structured logging — ingested by ELK, BigQuery, or S3 + Athena
        self.logger.info(json.dumps(log_entry))

# In your inference endpoint
pred_logger = PredictionLogger(logging.getLogger('predictions'))

@app.post("/predict")
def predict(request: PredictRequest):
    features = extract_features(request)
    prediction = model.predict([features])[0]

    pred_logger.log(
        model_version="v2",
        features=features,
        prediction=float(prediction),
        metadata={'user_id': request.user_id, 'latency_ms': ...}
    )

    return {"prediction": prediction}

Now you can query: “Show me the distribution of predictions for model v2 in the last 24 hours.” You can detect drift (KL(PtrainPprod)\text{KL}(P_{\text{train}} \| P_{\text{prod}})) before users complain.

Pattern 5: Canary with Traffic Shadowing, Not Load Balancing

Standard canary: route 5% of traffic to the new model, monitor metrics, roll out if it looks good.

The problem: if the new model is bad, 5% of your users get bad predictions. And if the failure mode is subtle (slightly worse precision, not crashes), you might not catch it until you’ve shipped to 50%.

Shadowing: route 100% of traffic to the old model for actual predictions, but also send requests to the new model and log its outputs without returning them.

import asyncio

class ShadowPredictor:
    def __init__(self, primary_model, shadow_model, logger):
        self.primary = primary_model
        self.shadow = shadow_model
        self.logger = logger

    async def predict(self, features):
        # Primary model (blocks request)
        primary_pred = self.primary.predict([features])[0]

        # Shadow model (non-blocking)
        asyncio.create_task(self._shadow_predict(features, primary_pred))

        return primary_pred

    async def _shadow_predict(self, features, primary_pred):
        try:
            shadow_pred = self.shadow.predict([features])[0]
            self.logger.info(json.dumps({
                'primary': float(primary_pred),
                'shadow': float(shadow_pred),
                'diff': abs(primary_pred - shadow_pred)
            }))
        except Exception as e:
            # Shadow failures don't affect users
            self.logger.error(f"Shadow prediction failed: {e}")

After a few hours, query the logs: “What % of shadow predictions differ from primary by more than 0.1?” If the new model crashes on 5% of inputs, you see it in the logs, not in user complaints.

An artistic view of an empty measuring glass highlighting metric and ounce measurements.
Photo by Steve Johnson on Pexels

Pattern 6: Batch Predict with async/await, Not ThreadPoolExecutor

When you need to score a million rows, the naive approach is:

for row in rows:
    prediction = model.predict(row)
    save(prediction)

Slow. The better approach used to be ThreadPoolExecutor for parallelism. But I’ve found async/await simpler and often faster, especially when I/O (database writes, API calls) is part of the loop.

import asyncio
import aiohttp

async def predict_and_save(session, row):
    # Model inference (CPU-bound, but fast)
    prediction = model.predict([row['features']])[0]

    # Save to API (I/O-bound)
    async with session.post('https://api.example.com/save', json={
        'id': row['id'],
        'prediction': float(prediction)
    }) as resp:
        return await resp.json()

async def batch_predict(rows):
    async with aiohttp.ClientSession() as session:
        tasks = [predict_and_save(session, row) for row in rows]
        results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

# Run
results = asyncio.run(batch_predict(rows))

On a batch of 10,000 predictions with network writes, this cut runtime from 12 minutes (sequential) to 90 seconds (async). The key: the I/O wait time overlaps instead of stacking.

If your inference is CPU-heavy (e.g., large transformer models), you still need multiprocessing. But for lightweight models + I/O, async wins.

Pattern 7: Model Warmup in Health Checks

Kubernetes kills your pod. A new one starts. The health check passes (the process is running). Traffic arrives. The first request takes 8 seconds because the model isn’t loaded yet.

What works: lazy-load the model on startup, but include a warmup step in the readiness probe.

from fastapi import FastAPI
import numpy as np

app = FastAPI()
model = None  # Lazy load

@app.on_event("startup")
async def load_model():
    global model
    model = load_my_model()  # This might take 2-3 seconds

    # Warmup: run a dummy prediction to JIT-compile, load into cache, etc.
    dummy_input = np.zeros((1, 10), dtype=np.float32)
    _ = model.predict(dummy_input)

@app.get("/health/ready")
def readiness():
    if model is None:
        return {"status": "not ready"}, 503

    # Bonus: verify the model actually works
    try:
        test_input = np.ones((1, 10), dtype=np.float32)
        _ = model.predict(test_input)
        return {"status": "ready"}
    except Exception as e:
        return {"status": "error", "detail": str(e)}, 503

Now Kubernetes won’t route traffic until the model is loaded AND verified. First-request latency drops from 8s to 50ms.

Pattern 8: Exponential Backoff for Upstream Failures

Your ML service calls an upstream API for features. The API is rate-limited: 100 requests/second. You send 150. It returns 429.

The naive retry: requests.post(..., retries=3). This hammers the API three more times immediately, burning through your quota and probably getting banned.

What works: exponential backoff with jitter.

import time
import random

def fetch_features_with_retry(user_id, max_retries=5):
    for attempt in range(max_retries):
        try:
            resp = requests.get(f"https://api.example.com/features/{user_id}")
            resp.raise_for_status()
            return resp.json()
        except requests.HTTPError as e:
            if e.response.status_code == 429:  # Rate limit
                if attempt == max_retries - 1:
                    raise

                # Exponential backoff: 2^attempt seconds, plus jitter
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
            else:
                raise

    raise Exception("Max retries exceeded")

The formula t=2n+ϵt = 2^n + \epsilon where ϵU(0,1)\epsilon \sim U(0, 1) spreads out retries and avoids the thundering herd problem (multiple services retrying at the exact same time).

Pattern 9: Schema Validation at the API Boundary

Your FastAPI endpoint accepts PredictRequest, which has 10 features. One day, a client sends a request with 9 features. Pydantic coerces the missing one to None. Your model crashes with a cryptic error about array shapes.

Or worse: the client sends the features in the wrong order. The model accepts it (still 10 floats), but predictions are garbage.

The fix: strict schema validation with explicit feature names.

from pydantic import BaseModel, validator
from typing import Dict

class PredictRequest(BaseModel):
    features: Dict[str, float]  # Explicit keys, not a list

    @validator('features')
    def check_required_features(cls, v):
        required = {'age', 'income', 'score', 'days_since_signup', ...}
        missing = required - v.keys()
        if missing:
            raise ValueError(f"Missing features: {missing}")

        extra = v.keys() - required
        if extra:
            raise ValueError(f"Unexpected features: {extra}")

        return v

@app.post("/predict")
def predict(request: PredictRequest):
    # Convert to array in the CORRECT order
    feature_order = ['age', 'income', 'score', ...]  # Matches training
    features = [request.features[k] for k in feature_order]

    prediction = model.predict([features])[0]
    return {"prediction": float(prediction)}

Now if the client messes up, they get a 422 Unprocessable Entity with a clear error message. Not a 500 with a stack trace.

Pattern 10: Model Metrics in Prometheus, Business Metrics in BigQuery

Monitoring split: infrastructure metrics (latency, throughput, error rate) go to Prometheus for real-time alerting. Prediction-level metrics (accuracy, drift, feature distributions) go to a data warehouse for offline analysis.

Don’t try to do both in one system. Prometheus scrapes every 15 seconds — too frequent for analyzing a week of prediction drift. BigQuery has 10-second query latency — too slow for a P95 latency alert.

from prometheus_client import Counter, Histogram
import time

# Prometheus metrics (real-time)
prediction_count = Counter('predictions_total', 'Total predictions', ['model_version'])
prediction_latency = Histogram('prediction_latency_seconds', 'Prediction latency')

@app.post("/predict")
def predict(request: PredictRequest):
    start = time.time()

    prediction = model.predict(...)

    # Update Prometheus
    prediction_count.labels(model_version="v2").inc()
    prediction_latency.observe(time.time() - start)

    # Log to BigQuery (async, buffered)
    log_to_bigquery({
        'timestamp': datetime.utcnow(),
        'features': request.features,
        'prediction': prediction,
        'model_version': 'v2'
    })

    return {"prediction": prediction}

Prometheus: “Alert me if P95 latency > 200ms for 5 minutes.”

BigQuery: “Show me the distribution of predicted probabilities for model v2 vs v1, grouped by user country, for the last 30 days.”

Pattern 11: Feature Flags for Model Logic, Not Model Weights

Feature flags are great for toggling new models on/off. But I’ve seen teams use them wrong: they gate the model artifact path.

if feature_flags.is_enabled('use_v2_model'):
    model = load('models/v2.onnx')
else:
    model = load('models/v1.onnx')

The problem: you can’t A/B test this. The flag is global — either everyone gets v2 or everyone gets v1.

Better: use the flag to control rollout logic, not model selection.

def get_model_version(user_id):
    if feature_flags.is_enabled_for_user('use_v2_model', user_id):
        return 'v2'
    return 'v1'

model_version = get_model_version(request.user_id)
model = model_registry.load(model_version)

Now you can A/B test: 10% of users get v2, 90% get v1. You can target by user cohort, geography, or a hash of the user ID (for reproducibility).

Pattern 12: Training/Serving Skew Check in CI

The most insidious bug: your training code and serving code look like they do the same thing, but they don’t. Different rounding. Different handling of nulls. Different feature ordering.

Fix: unit test that the training and serving pipelines produce identical outputs.

import pytest
import numpy as np
from train import preprocess_for_training
from serve import preprocess_for_serving

def test_training_serving_consistency():
    """Training and serving must produce identical preprocessed features"""
    raw_input = {
        'age': 35,
        'income': 75000.5,
        'country': 'US',
        'score': None  # Edge case: missing value
    }

    # Preprocess with training pipeline
    train_features = preprocess_for_training(raw_input)

    # Preprocess with serving pipeline
    serve_features = preprocess_for_serving(raw_input)

    # Must match exactly (not just "close enough")
    np.testing.assert_array_equal(train_features, serve_features)

If this test breaks, your deploy is blocked. You don’t ship the skew to production.

I run this on every commit. Saved me twice: once when someone changed the serving code to handle nulls differently, once when a dependency update changed pandas’ default dtype inference.

When These Patterns Break Down

A few caveats:

  • Pattern 1 (feature store as cache): Doesn’t work if your features are large (e.g., embeddings). You might actually need the database approach.
  • Pattern 5 (traffic shadowing): Doubles your infrastructure cost. Only worth it if the cost of a bad deploy is higher.
  • Pattern 6 (async for batching): Assumes I/O-bound workload. If you’re running BERT inference on CPU, you need multiprocessing or a GPU.
  • Pattern 10 (separate metric systems): Adds operational complexity. For a small team, putting everything in one place (even if suboptimal) might be the right call.

I’d guess these patterns apply to ~80% of production ML systems. The other 20% are either too simple (a cron job that runs a model once a day) or too complex (multi-model ensemble serving with microsecond SLAs).

FAQ

Q: Should I use a managed MLOps platform (SageMaker, Vertex AI) or build this myself?

Depends on team size and ML maturity. If you have <3 models in production and a small team, use a managed platform — you’ll ship faster. If you have 10+ models and specific performance needs (e.g., sub-50ms latency), you’ll probably outgrow the platform and need custom infra. I’ve done both. The middle ground is the hardest: too complex for managed, too small to justify a dedicated ML platform team.

Q: How do I decide what to log for drift detection?

Start with feature distributions (mean, std, quantiles) and prediction distributions. If you have ground truth labels (even delayed), log those too. Don’t log every single prediction to BigQuery on day one — it’s expensive and you won’t use 90% of it. Add more logging when you hit a specific question you can’t answer.

Q: What’s the most common production failure mode you’ve seen?

Silent training/serving skew. The model works great offline, accuracy drops 15 points in production, and nobody notices for a week because the API returns 200 OK. The fix: log predictions, run offline evaluation on production traffic, and add the CI test from Pattern 12.

What I Haven’t Figured Out Yet

Model retraining cadence still feels arbitrary to me. “Retrain weekly” works until it doesn’t. I’ve seen drift detection trigger retraining too aggressively (every minor shift isn’t worth the deploy cost) and too conservatively (accuracy decayed 10 points before the alert fired). The ideal trigger probably varies by model and domain, but I don’t have a principled way to set it yet.

If you’ve solved this, I’d be curious to hear how you’re doing it.

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 331 | TOTAL 5,370