- Feature stores work best as caches with fallbacks, not single sources of truth — triple fallback (memory → Redis → compute) cuts P99 latency by 75%.
- Traffic shadowing beats canary deploys: send 100% to old model, log new model outputs without serving them to catch failures before users see them.
- Training/serving skew is the silent killer — CI tests that verify identical preprocessing outputs between training and serving pipelines prevent 15-point accuracy drops.
The MLOps Pattern You Actually Need
Most MLOps tutorials show you how to build a perfect pipeline. Then production happens.
Your model works in the notebook. It crashes in Docker. The preprocessing that took 2 seconds on your laptop takes 40 seconds on the API server. The monitoring dashboard you spent a week building never caught the bug that cost you three days of bad predictions.
Here are 12 patterns I’ve used across five production ML systems — not the comprehensive best practices, just the ones that solved actual problems. Some are obvious in hindsight. A few contradict what the documentation recommends. All of them have saved me from 2am debugging sessions.

Pattern 1: Feature Store as a Cache, Not a Database
The first feature store I built tried to be the single source of truth. Every prediction fetched features from Redis, every training job read from S3, everything stayed in sync.
It was architecturally beautiful and operationally fragile.
The breaking point: a network hiccup between the API server and Redis added 200ms to every prediction. Our SLA was 100ms total. The feature store became the bottleneck.
What actually works: treat the feature store as a cache with a fallback. Serve predictions from in-memory features when possible, fall back to the store only when you need fresh data. For batch predictions, bypass it entirely.
class FeatureServer:
def __init__(self, redis_client, fallback_fn):
self.cache = redis_client
self.fallback = fallback_fn
self._local_cache = {} # In-memory for sub-millisecond access
def get_features(self, user_id, max_age_seconds=60):
# Try in-memory first
if user_id in self._local_cache:
features, timestamp = self._local_cache[user_id]
if time.time() - timestamp < max_age_seconds:
return features
# Try Redis
try:
cached = self.cache.get(f"features:{user_id}")
if cached:
features = json.loads(cached)
self._local_cache[user_id] = (features, time.time())
return features
except redis.ConnectionError:
# Don't fail the prediction over a cache miss
pass
# Compute and cache
features = self.fallback(user_id)
self._local_cache[user_id] = (features, time.time())
# Best-effort cache write (don't block on it)
try:
self.cache.setex(f"features:{user_id}", 300, json.dumps(features))
except:
pass # Log this, but don't crash
return features
The triple fallback (memory → Redis → compute) cut P99 latency from 180ms to 45ms. And the predictions kept working when Redis went down.
Pattern 2: Model Versioning in the Path, Not the Code
Here’s how model versioning breaks in production: you add a version parameter to your API, update your code to route v1 vs v2, deploy, and realize you need to redeploy every time you want to change which model serves which version.
What works: bake the version into the model artifact path, use a symlink or config file to control routing.
# models/
# production -> v2/ (symlink)
# v1/
# model.onnx
# preprocessor.pkl
# v2/
# model.onnx
# preprocessor.pkl
class ModelRegistry:
def __init__(self, base_path="models"):
self.base_path = Path(base_path)
def load_model(self, alias="production"):
model_path = self.base_path / alias
# Follows symlink automatically
return onnxruntime.InferenceSession(
str(model_path / "model.onnx")
)
def promote(self, version, alias="production"):
"""Switch production to a different version — no code deploy needed"""
symlink = self.base_path / alias
target = self.base_path / version
if symlink.exists():
symlink.unlink()
symlink.symlink_to(target, target_is_directory=True)
Now you can A/B test by changing a symlink, not by shipping code. Your CI/CD can validate model versions independently of deployment.
Pattern 3: Preprocessing Once, Serialize It
The silent killer in ML serving: you preprocess the same way in training and inference, but you reimplement it twice. Training uses pandas. Inference uses a hand-rolled function. They drift.
I debugged this for two days once. The model’s accuracy in production was 8 points lower than validation. Turned out the inference code normalized features with mean=0, std=1 hardcoded, while training had computed mean=0.03, std=1.12 from the data.
Fix: serialize the preprocessing pipeline, not just the model.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
import joblib
# Training
preprocessor = ColumnTransformer([
('num', StandardScaler(), ['age', 'income', 'score']),
('cat', OneHotEncoder(handle_unknown='ignore'), ['country', 'device'])
])
full_pipeline = Pipeline([
('preprocess', preprocessor),
('model', XGBClassifier())
])
full_pipeline.fit(X_train, y_train)
joblib.dump(full_pipeline, 'pipeline.pkl') # Save the WHOLE thing
# Inference (guaranteed identical preprocessing)
pipeline = joblib.load('pipeline.pkl')
prediction = pipeline.predict(raw_input) # No manual preprocessing
If you can’t use sklearn pipelines (e.g., you’re doing custom text preprocessing), write a Preprocessor class and pickle it. The key: train and serve from the same serialized artifact.
Pattern 4: Log Predictions, Not Just Errors
Most ML monitoring watches for exceptions and 500s. The silent failure: your model starts predicting garbage, but the API returns 200 OK.
I’ve seen this happen three ways:
- Upstream data schema change: a feature that used to be float became string. The model accepted it (coercion happened silently), but predictions were random.
- Label leakage fixed: training data had a bug that leaked the target into features. When we fixed it, accuracy dropped 20 points. Monitoring never fired because the API worked fine.
- Gradual drift: user behavior shifted over 6 months. The model’s precision slowly decayed from 0.85 to 0.62. No single deploy caused it.
The pattern: log every prediction with input features and model output to a data warehouse. Run daily jobs to catch drift.
import json
import logging
from datetime import datetime
class PredictionLogger:
def __init__(self, logger):
self.logger = logger
def log(self, model_version, features, prediction, metadata=None):
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'model_version': model_version,
'features': features,
'prediction': prediction,
'metadata': metadata or {}
}
# Structured logging — ingested by ELK, BigQuery, or S3 + Athena
self.logger.info(json.dumps(log_entry))
# In your inference endpoint
pred_logger = PredictionLogger(logging.getLogger('predictions'))
@app.post("/predict")
def predict(request: PredictRequest):
features = extract_features(request)
prediction = model.predict([features])[0]
pred_logger.log(
model_version="v2",
features=features,
prediction=float(prediction),
metadata={'user_id': request.user_id, 'latency_ms': ...}
)
return {"prediction": prediction}
Now you can query: “Show me the distribution of predictions for model v2 in the last 24 hours.” You can detect drift () before users complain.
Pattern 5: Canary with Traffic Shadowing, Not Load Balancing
Standard canary: route 5% of traffic to the new model, monitor metrics, roll out if it looks good.
The problem: if the new model is bad, 5% of your users get bad predictions. And if the failure mode is subtle (slightly worse precision, not crashes), you might not catch it until you’ve shipped to 50%.
Shadowing: route 100% of traffic to the old model for actual predictions, but also send requests to the new model and log its outputs without returning them.
import asyncio
class ShadowPredictor:
def __init__(self, primary_model, shadow_model, logger):
self.primary = primary_model
self.shadow = shadow_model
self.logger = logger
async def predict(self, features):
# Primary model (blocks request)
primary_pred = self.primary.predict([features])[0]
# Shadow model (non-blocking)
asyncio.create_task(self._shadow_predict(features, primary_pred))
return primary_pred
async def _shadow_predict(self, features, primary_pred):
try:
shadow_pred = self.shadow.predict([features])[0]
self.logger.info(json.dumps({
'primary': float(primary_pred),
'shadow': float(shadow_pred),
'diff': abs(primary_pred - shadow_pred)
}))
except Exception as e:
# Shadow failures don't affect users
self.logger.error(f"Shadow prediction failed: {e}")
After a few hours, query the logs: “What % of shadow predictions differ from primary by more than 0.1?” If the new model crashes on 5% of inputs, you see it in the logs, not in user complaints.

Pattern 6: Batch Predict with async/await, Not ThreadPoolExecutor
When you need to score a million rows, the naive approach is:
for row in rows:
prediction = model.predict(row)
save(prediction)
Slow. The better approach used to be ThreadPoolExecutor for parallelism. But I’ve found async/await simpler and often faster, especially when I/O (database writes, API calls) is part of the loop.
import asyncio
import aiohttp
async def predict_and_save(session, row):
# Model inference (CPU-bound, but fast)
prediction = model.predict([row['features']])[0]
# Save to API (I/O-bound)
async with session.post('https://api.example.com/save', json={
'id': row['id'],
'prediction': float(prediction)
}) as resp:
return await resp.json()
async def batch_predict(rows):
async with aiohttp.ClientSession() as session:
tasks = [predict_and_save(session, row) for row in rows]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
# Run
results = asyncio.run(batch_predict(rows))
On a batch of 10,000 predictions with network writes, this cut runtime from 12 minutes (sequential) to 90 seconds (async). The key: the I/O wait time overlaps instead of stacking.
If your inference is CPU-heavy (e.g., large transformer models), you still need multiprocessing. But for lightweight models + I/O, async wins.
Pattern 7: Model Warmup in Health Checks
Kubernetes kills your pod. A new one starts. The health check passes (the process is running). Traffic arrives. The first request takes 8 seconds because the model isn’t loaded yet.
What works: lazy-load the model on startup, but include a warmup step in the readiness probe.
from fastapi import FastAPI
import numpy as np
app = FastAPI()
model = None # Lazy load
@app.on_event("startup")
async def load_model():
global model
model = load_my_model() # This might take 2-3 seconds
# Warmup: run a dummy prediction to JIT-compile, load into cache, etc.
dummy_input = np.zeros((1, 10), dtype=np.float32)
_ = model.predict(dummy_input)
@app.get("/health/ready")
def readiness():
if model is None:
return {"status": "not ready"}, 503
# Bonus: verify the model actually works
try:
test_input = np.ones((1, 10), dtype=np.float32)
_ = model.predict(test_input)
return {"status": "ready"}
except Exception as e:
return {"status": "error", "detail": str(e)}, 503
Now Kubernetes won’t route traffic until the model is loaded AND verified. First-request latency drops from 8s to 50ms.
Pattern 8: Exponential Backoff for Upstream Failures
Your ML service calls an upstream API for features. The API is rate-limited: 100 requests/second. You send 150. It returns 429.
The naive retry: requests.post(..., retries=3). This hammers the API three more times immediately, burning through your quota and probably getting banned.
What works: exponential backoff with jitter.
import time
import random
def fetch_features_with_retry(user_id, max_retries=5):
for attempt in range(max_retries):
try:
resp = requests.get(f"https://api.example.com/features/{user_id}")
resp.raise_for_status()
return resp.json()
except requests.HTTPError as e:
if e.response.status_code == 429: # Rate limit
if attempt == max_retries - 1:
raise
# Exponential backoff: 2^attempt seconds, plus jitter
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
else:
raise
raise Exception("Max retries exceeded")
The formula where spreads out retries and avoids the thundering herd problem (multiple services retrying at the exact same time).
Pattern 9: Schema Validation at the API Boundary
Your FastAPI endpoint accepts PredictRequest, which has 10 features. One day, a client sends a request with 9 features. Pydantic coerces the missing one to None. Your model crashes with a cryptic error about array shapes.
Or worse: the client sends the features in the wrong order. The model accepts it (still 10 floats), but predictions are garbage.
The fix: strict schema validation with explicit feature names.
from pydantic import BaseModel, validator
from typing import Dict
class PredictRequest(BaseModel):
features: Dict[str, float] # Explicit keys, not a list
@validator('features')
def check_required_features(cls, v):
required = {'age', 'income', 'score', 'days_since_signup', ...}
missing = required - v.keys()
if missing:
raise ValueError(f"Missing features: {missing}")
extra = v.keys() - required
if extra:
raise ValueError(f"Unexpected features: {extra}")
return v
@app.post("/predict")
def predict(request: PredictRequest):
# Convert to array in the CORRECT order
feature_order = ['age', 'income', 'score', ...] # Matches training
features = [request.features[k] for k in feature_order]
prediction = model.predict([features])[0]
return {"prediction": float(prediction)}
Now if the client messes up, they get a 422 Unprocessable Entity with a clear error message. Not a 500 with a stack trace.
Pattern 10: Model Metrics in Prometheus, Business Metrics in BigQuery
Monitoring split: infrastructure metrics (latency, throughput, error rate) go to Prometheus for real-time alerting. Prediction-level metrics (accuracy, drift, feature distributions) go to a data warehouse for offline analysis.
Don’t try to do both in one system. Prometheus scrapes every 15 seconds — too frequent for analyzing a week of prediction drift. BigQuery has 10-second query latency — too slow for a P95 latency alert.
from prometheus_client import Counter, Histogram
import time
# Prometheus metrics (real-time)
prediction_count = Counter('predictions_total', 'Total predictions', ['model_version'])
prediction_latency = Histogram('prediction_latency_seconds', 'Prediction latency')
@app.post("/predict")
def predict(request: PredictRequest):
start = time.time()
prediction = model.predict(...)
# Update Prometheus
prediction_count.labels(model_version="v2").inc()
prediction_latency.observe(time.time() - start)
# Log to BigQuery (async, buffered)
log_to_bigquery({
'timestamp': datetime.utcnow(),
'features': request.features,
'prediction': prediction,
'model_version': 'v2'
})
return {"prediction": prediction}
Prometheus: “Alert me if P95 latency > 200ms for 5 minutes.”
BigQuery: “Show me the distribution of predicted probabilities for model v2 vs v1, grouped by user country, for the last 30 days.”
Pattern 11: Feature Flags for Model Logic, Not Model Weights
Feature flags are great for toggling new models on/off. But I’ve seen teams use them wrong: they gate the model artifact path.
if feature_flags.is_enabled('use_v2_model'):
model = load('models/v2.onnx')
else:
model = load('models/v1.onnx')
The problem: you can’t A/B test this. The flag is global — either everyone gets v2 or everyone gets v1.
Better: use the flag to control rollout logic, not model selection.
def get_model_version(user_id):
if feature_flags.is_enabled_for_user('use_v2_model', user_id):
return 'v2'
return 'v1'
model_version = get_model_version(request.user_id)
model = model_registry.load(model_version)
Now you can A/B test: 10% of users get v2, 90% get v1. You can target by user cohort, geography, or a hash of the user ID (for reproducibility).
Pattern 12: Training/Serving Skew Check in CI
The most insidious bug: your training code and serving code look like they do the same thing, but they don’t. Different rounding. Different handling of nulls. Different feature ordering.
Fix: unit test that the training and serving pipelines produce identical outputs.
import pytest
import numpy as np
from train import preprocess_for_training
from serve import preprocess_for_serving
def test_training_serving_consistency():
"""Training and serving must produce identical preprocessed features"""
raw_input = {
'age': 35,
'income': 75000.5,
'country': 'US',
'score': None # Edge case: missing value
}
# Preprocess with training pipeline
train_features = preprocess_for_training(raw_input)
# Preprocess with serving pipeline
serve_features = preprocess_for_serving(raw_input)
# Must match exactly (not just "close enough")
np.testing.assert_array_equal(train_features, serve_features)
If this test breaks, your deploy is blocked. You don’t ship the skew to production.
I run this on every commit. Saved me twice: once when someone changed the serving code to handle nulls differently, once when a dependency update changed pandas’ default dtype inference.
When These Patterns Break Down
A few caveats:
- Pattern 1 (feature store as cache): Doesn’t work if your features are large (e.g., embeddings). You might actually need the database approach.
- Pattern 5 (traffic shadowing): Doubles your infrastructure cost. Only worth it if the cost of a bad deploy is higher.
- Pattern 6 (async for batching): Assumes I/O-bound workload. If you’re running BERT inference on CPU, you need multiprocessing or a GPU.
- Pattern 10 (separate metric systems): Adds operational complexity. For a small team, putting everything in one place (even if suboptimal) might be the right call.
I’d guess these patterns apply to ~80% of production ML systems. The other 20% are either too simple (a cron job that runs a model once a day) or too complex (multi-model ensemble serving with microsecond SLAs).
FAQ
Q: Should I use a managed MLOps platform (SageMaker, Vertex AI) or build this myself?
Depends on team size and ML maturity. If you have <3 models in production and a small team, use a managed platform — you’ll ship faster. If you have 10+ models and specific performance needs (e.g., sub-50ms latency), you’ll probably outgrow the platform and need custom infra. I’ve done both. The middle ground is the hardest: too complex for managed, too small to justify a dedicated ML platform team.
Q: How do I decide what to log for drift detection?
Start with feature distributions (mean, std, quantiles) and prediction distributions. If you have ground truth labels (even delayed), log those too. Don’t log every single prediction to BigQuery on day one — it’s expensive and you won’t use 90% of it. Add more logging when you hit a specific question you can’t answer.
Q: What’s the most common production failure mode you’ve seen?
Silent training/serving skew. The model works great offline, accuracy drops 15 points in production, and nobody notices for a week because the API returns 200 OK. The fix: log predictions, run offline evaluation on production traffic, and add the CI test from Pattern 12.
What I Haven’t Figured Out Yet
Model retraining cadence still feels arbitrary to me. “Retrain weekly” works until it doesn’t. I’ve seen drift detection trigger retraining too aggressively (every minor shift isn’t worth the deploy cost) and too conservatively (accuracy decayed 10 points before the alert fired). The ideal trigger probably varies by model and domain, but I don’t have a principled way to set it yet.
If you’ve solved this, I’d be curious to hear how you’re doing it.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply