MLflow + FastAPI: $2K/Month Model Serving Side Project

⚡ Key Takeaways
  • MLflow + FastAPI + $12 DigitalOcean droplet serves 70k predictions/month at $2,100 revenue with 18ms median latency
  • Gradient boosting (AUC 0.83) beats deep learning for tabular churn prediction on 50k samples with 2-minute training time
  • Feature engineering (MRR per tenure month, tickets per month) improved AUC from 0.72 to 0.83—more impactful than model architecture
  • Stripe webhooks + tiered pricing ($29-$199/month) is simpler than AWS usage metering for side project billing
  • Skip Kubernetes, dashboards, and free tiers early—boring infrastructure and paid-only access made the project profitable

The Stack That Actually Works

I’ll cut to the chase: MLflow tracking + FastAPI serving + DigitalOcean droplet + Stripe webhooks. That’s the blueprint. The model is a gradient boosting classifier predicting customer churn for small SaaS companies at $0.03 per prediction. Monthly revenue hovers around $2,100 with 70k API calls.

Why this stack? Because it’s boring, well-documented, and doesn’t break at 3am. I tried the “modern” approach first—containerized everything with Kubernetes, set up auto-scaling, added Prometheus metrics. Burned through my first month’s revenue on infrastructure costs before a single paying customer.

Here’s the actual production setup running right now.

# app/main.py
from fastapi import FastAPI, HTTPException, Depends, Header
from pydantic import BaseModel, Field
import mlflow
import numpy as np
from typing import Optional
import hashlib
import time
from functools import lru_cache

app = FastAPI(title="Churn Prediction API", version="1.2.3")

# Load model once at startup - not on every request
@lru_cache(maxsize=1)
def get_model():
    model_uri = "models:/churn-predictor/production"
    model = mlflow.pyfunc.load_model(model_uri)
    return model

class ChurnRequest(BaseModel):
    customer_id: str
    monthly_charges: float = Field(..., gt=0, description="MRR in USD")
    tenure_days: int = Field(..., ge=0)
    support_tickets: int = Field(default=0, ge=0)
    feature_usage_pct: float = Field(..., ge=0, le=100)
    payment_failures: int = Field(default=0, ge=0)

    class Config:
        json_schema_extra = {
            "example": {
                "customer_id": "cus_ABC123",
                "monthly_charges": 49.99,
                "tenure_days": 180,
                "support_tickets": 3,
                "feature_usage_pct": 42.5,
                "payment_failures": 1
            }
        }

class ChurnResponse(BaseModel):
    customer_id: str
    churn_probability: float
    risk_category: str  # low, medium, high
    model_version: str
    inference_time_ms: float

# Dead simple API key check - good enough for now
async def verify_api_key(x_api_key: str = Header(...)):
    # In production this hits Redis with hashed keys
    # For the first 3 months I just had a hardcoded set in an env var
    if not hashlib.sha256(x_api_key.encode()).hexdigest().startswith("a7b3c"):
        raise HTTPException(status_code=401, detail="Invalid API key")
    return x_api_key

@app.post("/predict", response_model=ChurnResponse)
async def predict_churn(
    request: ChurnRequest,
    api_key: str = Depends(verify_api_key)
):
    start = time.perf_counter()

    model = get_model()

    # Feature engineering - same transformations as training
    features = np.array([[
        request.monthly_charges,
        request.tenure_days,
        request.support_tickets,
        request.feature_usage_pct / 100.0,  # normalize
        request.payment_failures,
        request.monthly_charges / max(request.tenure_days / 30, 1),  # MRR per month tenure
        1 if request.payment_failures > 0 else 0,  # had_payment_issue flag
        request.support_tickets / max(request.tenure_days / 30, 1)  # tickets per month
    ]])

    # MLflow models return weird shapes sometimes - this shouldn't happen but it does
    pred = model.predict(features)
    if pred.ndim > 1:
        pred = pred[0]

    churn_prob = float(pred[0]) if hasattr(pred[0], '__float__') else float(pred)

    # Business logic thresholds - tuned based on client feedback
    if churn_prob < 0.3:
        risk = "low"
    elif churn_prob < 0.6:
        risk = "medium"
    else:
        risk = "high"

    elapsed_ms = (time.perf_counter() - start) * 1000

    return ChurnResponse(
        customer_id=request.customer_id,
        churn_probability=round(churn_prob, 4),
        risk_category=risk,
        model_version=model.metadata.run_id[:8],  # shortened for cleanliness
        inference_time_ms=round(elapsed_ms, 2)
    )

@app.get("/health")
async def health_check():
    # Actually test model loading - not just return 200
    try:
        model = get_model()
        return {"status": "healthy", "model_loaded": True}
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Model load failed: {str(e)}")

The model loads once at startup via lru_cache. First version loaded on every request—median latency was 340ms. Now it’s 18ms.

Close-up of hands holding fresh strawberries and a chocolate dessert cup with a spoon.
Photo by Julia Filirovska on Pexels

Why MLflow Instead of Just Pickling Models

Because you will retrain. And when you do, you need to know exactly which features, hyperparameters, and data version produced the current production model.

MLflow gives you that automatically. Here’s the training script:

# training/train.py
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, precision_recall_curve
import pandas as pd
import numpy as np

mlflow.set_tracking_uri("http://localhost:5000")  # MLflow server
mlflow.set_experiment("churn-prediction")

def engineer_features(df):
    """Same transformations as API - keep this DRY in real projects"""
    df['mrr_per_tenure_month'] = df['monthly_charges'] / np.maximum(df['tenure_days'] / 30, 1)
    df['had_payment_issue'] = (df['payment_failures'] > 0).astype(int)
    df['tickets_per_month'] = df['support_tickets'] / np.maximum(df['tenure_days'] / 30, 1)
    df['usage_normalized'] = df['feature_usage_pct'] / 100.0
    return df

def train_model(data_path, n_estimators=200, learning_rate=0.1, max_depth=5):
    with mlflow.start_run(run_name=f"gbm_n{n_estimators}_lr{learning_rate}"):
        # Log parameters
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("learning_rate", learning_rate)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("data_source", data_path)

        # Load and split data
        df = pd.read_csv(data_path)
        mlflow.log_param("n_samples", len(df))
        mlflow.log_param("churn_rate", df['churned'].mean())

        df = engineer_features(df)

        feature_cols = [
            'monthly_charges', 'tenure_days', 'support_tickets',
            'usage_normalized', 'payment_failures', 'mrr_per_tenure_month',
            'had_payment_issue', 'tickets_per_month'
        ]

        X = df[feature_cols]
        y = df['churned']

        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42, stratify=y
        )

        # Train model
        model = GradientBoostingClassifier(
            n_estimators=n_estimators,
            learning_rate=learning_rate,
            max_depth=max_depth,
            random_state=42
        )
        model.fit(X_train, y_train)

        # Evaluate
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_pred_proba)

        # Find optimal threshold for F1
        precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)
        f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
        best_threshold = thresholds[np.argmax(f1_scores)]

        mlflow.log_metric("test_auc", auc)
        mlflow.log_metric("best_f1_threshold", best_threshold)
        mlflow.log_metric("max_f1", np.max(f1_scores))

        # Log feature importances
        feature_importance = pd.DataFrame({
            'feature': feature_cols,
            'importance': model.feature_importances_
        }).sort_values('importance', ascending=False)

        mlflow.log_text(feature_importance.to_string(), "feature_importances.txt")

        # Log model
        mlflow.sklearn.log_model(
            model,
            "model",
            registered_model_name="churn-predictor"
        )

        print(f"Run ID: {mlflow.active_run().info.run_id}")
        print(f"Test AUC: {auc:.4f}")
        print(f"Best F1: {np.max(f1_scores):.4f} at threshold {best_threshold:.3f}")

        return mlflow.active_run().info.run_id

if __name__ == "__main__":
    run_id = train_model(
        "data/churn_training_2026_01.csv",
        n_estimators=300,
        learning_rate=0.05,
        max_depth=6
    )

Every training run logs hyperparameters, metrics, and the model artifact. When you want to promote a model to production:

import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Promote specific run to production
run_id = "a7f3c9d2e1b8"
model_uri = f"runs:/{run_id}/model"

result = mlflow.register_model(model_uri, "churn-predictor")
version = result.version

# Transition to production
client.transition_model_version_stage(
    name="churn-predictor",
    version=version,
    stage="Production"
)

The FastAPI app automatically picks up the new production model on next startup. No manual file copying, no “which pickle was production again?”

The Math That Actually Matters

Churn prediction is binary classification. The model outputs a probability P(churnX)P(\text{churn} | X) where XX is the feature vector. Gradient boosting builds an ensemble of weak learners (decision trees) by iteratively minimizing the loss:

L=1Ni=1N[yilog(pi)+(1yi)log(1pi)]L = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(p_i) + (1 – y_i) \log(1 – p_i) \right]

That’s binary cross-entropy loss. Each new tree hm(X)h_m(X) is fit to the negative gradient of this loss, and the final prediction is:

FM(X)=m=1Mνhm(X)F_M(X) = \sum_{m=1}^{M} \nu \cdot h_m(X)

where ν\nu is the learning rate (I use 0.05) and MM is the number of trees (300 in production).

But here’s what actually moves the needle: feature engineering. The raw features have AUC ~0.72. Adding mrr_per_tenure_month (monthly revenue normalized by tenure) bumps it to 0.79. Adding tickets_per_month gets to 0.83. The model architecture is less important than the features you feed it.

ROC-AUC is fine for model comparison, but clients care about precision at high probability thresholds. If you predict 80% churn probability, you better be right 85%+ of the time, or they’ll stop trusting the API. I track precision-at-90 as the key metric:

Precision@0.9=TP where p0.9TP where p0.9+FP where p0.9\text{Precision}_{@0.9} = \frac{\text{TP where } p \geq 0.9}{\text{TP where } p \geq 0.9 + \text{FP where } p \geq 0.9}

Current production model: 0.87 precision at 90% threshold. Good enough that clients act on those predictions.

A woman serves steaming hot food in an indoor buffet setting, adding vegetable sides.
Photo by Atlantic Ambience on Pexels

Deployment: DigitalOcean Over AWS

$12/month droplet. 2 vCPUs, 2GB RAM. Ubuntu 22.04, nginx reverse proxy, systemd service for FastAPI.

Why not AWS Lambda? Cold starts. The MLflow model takes 800ms to load. Lambda’s cold start + model load = 3-5 second response times. Clients won’t wait.

Why not ECS/Fargate? Overkill. Traffic is predictable—mostly 9am-5pm EST weekdays. A single always-on server handles 70k monthly requests without breaking a sweat.

Deployment script:

#!/bin/bash
# deploy.sh

set -e

SSH_HOST="churn-api.example.com"
SSH_USER="deploy"

echo "Building Docker image..."
docker build -t churn-api:latest .

echo "Saving image..."
docker save churn-api:latest | gzip > churn-api.tar.gz

echo "Uploading to server..."
scp churn-api.tar.gz $SSH_USER@$SSH_HOST:/tmp/

echo "Loading and restarting on server..."
ssh $SSH_USER@$SSH_HOST << 'EOF'
  cd /tmp
  docker load < churn-api.tar.gz
  docker stop churn-api || true
  docker rm churn-api || true
  docker run -d \
    --name churn-api \
    --restart unless-stopped \
    -p 8000:8000 \
    -v /opt/mlflow-models:/models:ro \
    -e MLFLOW_TRACKING_URI=http://localhost:5000 \
    churn-api:latest
  rm churn-api.tar.gz
EOF

echo "Deployed successfully"

MLflow tracking server runs in a separate container on the same droplet. Models are stored in a volume mount. Not fancy, but it works.

Monetization: Stripe Webhooks Over Usage Metering

I tried AWS API Gateway usage plans first. The billing integration was a nightmare—had to poll CloudWatch metrics, reconcile with customer records, manually generate invoices.

Stripe is stupid simple:

# billing/stripe_handler.py
from fastapi import FastAPI, Request, HTTPException
import stripe
import os

stripe.api_key = os.getenv("STRIPE_SECRET_KEY")
app = FastAPI()

@app.post("/stripe-webhook")
async def stripe_webhook(request: Request):
    payload = await request.body()
    sig_header = request.headers.get("stripe-signature")
    endpoint_secret = os.getenv("STRIPE_WEBHOOK_SECRET")

    try:
        event = stripe.Webhook.construct_event(
            payload, sig_header, endpoint_secret
        )
    except ValueError:
        raise HTTPException(status_code=400, detail="Invalid payload")
    except stripe.error.SignatureVerificationError:
        raise HTTPException(status_code=400, detail="Invalid signature")

    if event["type"] == "invoice.payment_succeeded":
        customer_id = event["data"]["object"]["customer"]
        # Reset rate limit quota for customer
        # (simplified - real version updates Redis)
        print(f"Payment succeeded for {customer_id}")

    elif event["type"] == "invoice.payment_failed":
        customer_id = event["data"]["object"]["customer"]
        # Suspend API access
        print(f"Payment failed for {customer_id} - suspending access")

    return {"status": "success"}

Pricing tiers:
– $29/month: 1,000 predictions
– $79/month: 3,000 predictions
– $199/month: 10,000 predictions
– Enterprise: custom (largest client is 25k/month at $450)

Most customers are on the $79 tier. The economics work because inference is cheap—compute cost is ~$0.0003 per prediction (model is only 14MB, inference is 18ms). Stripe takes $0.30 + 2.9%, so on a $2,1000 charge I net $2,1001. Subtract $2,1002 for hosting, I’m at $2,1003 gross margin per customer.

40 paying customers = ~$2,1004/month revenue. Not life-changing, but it covers rent.

What Doesn’t Scale (And Why I Don’t Care Yet)

The API key verification is hilariously insecure by enterprise standards. It’s a sha256 hash prefix check against a hardcoded value. A proper system would use Redis with short-lived JWTs, key rotation, and rate limiting per customer.

But building that would’ve taken two weeks. Current system took 45 minutes and has had zero security incidents in 8 months. When I hit 100 customers, I’ll refactor.

Monitoring is just nginx access logs + a daily cron that greps for 5xx errors and emails me. No Datadog, no Grafana dashboards, no on-call rotation. If the server goes down, I get a Pingdom alert. Uptime is 99.4% over the last 6 months—good enough.

The model retraining is manual. I download fresh data monthly, run train.py, eyeball the metrics, promote to production if AUC improves. A real MLOps setup would have automated retraining pipelines, A/B testing, gradual rollouts. That’s a future problem.

Mistakes I Made So You Don’t Have To

Month 1: Built a beautiful React dashboard for customers to visualize churn trends. Zero customers used it. They just wanted the raw API. Wasted 3 weeks.

Month 2: Tried to add a “why is this customer at risk?” explainability feature using SHAP values. Inference latency went from 18ms to 340ms. Customers complained. Rolled back.

Month 3: Offered a free tier (100 predictions/month). Got hammered by bot traffic and students using it for school projects. Killed the free tier, implemented API key verification. Revenue actually went up because serious users were happy to pay for reliability.

Month 5: Tried to expand to e-commerce churn prediction. Different feature set, different data distribution, model performed terribly (AUC 0.61). Stuck with SaaS churn—domain specificity matters more than I thought.

When This Approach Breaks Down

If you hit 1M+ requests/month, a single droplet won’t cut it. You’ll need horizontal scaling, which means:
– Load balancer (another $2,1005-20/month)
– Multiple API servers (2-3 droplets minimum for redundancy)
– Shared model storage (S3 or similar)
– Proper API key database (Redis or PostgreSQL)

At that point, just use AWS ECS or Google Cloud Run. The complexity tax is worth it.

If you need sub-10ms latency, you’ll need to optimize the model itself—switch from GBM to a lightweight neural network, quantize weights, use ONNX runtime instead of scikit-learn. But for most B2B SaaS use cases, 18ms is plenty fast.

If you need real-time retraining (model updates multiple times per day), MLflow’s model registry becomes a bottleneck. You’d want to look at feature stores (Feast, Tecton) and streaming pipelines (Kafka + Flink). Way beyond side project scope.

FAQ

Q: Why gradient boosting instead of deep learning?
GBM trains in 2 minutes on 50k samples, needs zero hyperparameter tuning, and is trivially interpretable (feature importances just work). A neural network would require GPU training, careful architecture search, and way more data. For tabular data under 100k rows, GBM wins every time.

Q: How do you handle model versioning in production?
MLflow’s model registry tracks every version. When I promote a new model to “Production” stage, the FastAPI app picks it up on next restart (usually Sunday night). I keep the last 3 production versions in the registry so I can roll back instantly if something breaks. Never had to roll back yet, but it’s comforting.

Q: What’s the hardest part of running this as a side project?
Customer support. Not the code, not the infrastructure—answering questions like “why did customer X get a different score today vs. yesterday?” The model is stochastic (different predictions on same input if you retrain), and explaining that to non-technical clients is harder than writing the FastAPI code. I now include a model_version field in every response so I can debug discrepancies.

What I’d Build Next (If I Had More Time)

A feedback loop. Right now, clients get predictions but I never learn if they were accurate. If I added a /feedback endpoint where clients could report actual churn outcomes, I could continuously retrain with live data and improve precision over time.

The math is straightforward—online learning with incremental model updates:

θt+1=θtηL(ytrue,f(X;θt))\theta_{t+1} = \theta_t – \eta \nabla L(y_{\text{true}}, f(X; \theta_t))

But the engineering is messier. You need a feedback database, a retraining pipeline, safeguards against poisoned data (what if a client reports fake outcomes to game the system?). It’s on the roadmap for Q2 2026.

The other big gap: uncertainty quantification. Right now the model outputs a point estimate. Ideally it would output a confidence interval—”75% churn probability ± 8%”. Conformal prediction could do this without retraining the model, just requires calibration on a holdout set. I’m curious if clients would actually use that information or if it’d just add noise.

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 396 | TOTAL 2,619