Model Drift Detection Failed Silently: Evidently AI Fix

⚡ Key Takeaways
  • Evidently AI can fail silently when production feature schemas don't match training data, returning empty reports instead of raising errors.
  • The fix requires explicit schema normalization before running Evidently, upfront validation of feature presence, and post-run checks to catch empty metric reports.
  • Wasserstein distance is more interpretable than Kolmogorov-Smirnov tests for numerical drift — it gives concrete units (e.g., 5 days shift in account age) rather than just p-values.
  • Drift detection should be paired with performance monitoring; statistical drift without accuracy drop often doesn't warrant retraining, while performance degradation without detected drift suggests label shift or interaction effects.

The Silent Failure

Evidently AI’s drift detection returned None instead of raising an error. The dashboard showed green checkmarks. The model was degrading in production.

This one hits different because the monitoring tool itself gave false confidence. When Report.run() completes without exceptions but the drift metrics are missing, you assume everything’s fine. It’s not. The feature schema mismatch between training data and production inference was silently ignored, and I only caught it when manual accuracy checks showed a 12% drop over three weeks.

Here’s what actually happens when Evidently’s column mapping breaks.

Modern abstract 3D render showcasing a complex geometric structure in cool hues.
Photo by Google DeepMind on Pexels

What Evidently Expects (and Doesn’t Tell You)

Evidently AI builds drift reports by comparing reference data (training set) against current data (production inference). The core assumption: both datasets share the same feature schema.

But production rarely matches training exactly. You add logging fields. You rename columns for clarity. You compute derived features differently. Evidently doesn’t validate this upfront — it just fails gracefully, which is the worst kind of failure for monitoring.

The library uses ColumnMapping to reconcile schema differences:

from evidently.pipeline.column_mapping import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

column_mapping = ColumnMapping(
    target='churn_label',
    prediction='model_output',
    numerical_features=['account_age_days', 'total_spent_usd'],
    categorical_features=['subscription_tier', 'region']
)

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=prod_df, column_mapping=column_mapping)

If prod_df has total_spent_dollars instead of total_spent_usd, Evidently skips that feature. No warning. No exception. The drift calculation runs on the subset of matching columns, and you get a partial report that looks complete.

The Schema Mismatch That Cost Me Three Weeks

My production pipeline logged predictions with timestamped feature names: feature_v2_account_age, feature_v2_total_spent. Training data had account_age_days, total_spent_usd. The drift report ran daily via Airflow, completed successfully, and showed “No drift detected” because it was comparing exactly zero overlapping features.

The model was actually drifting on account_age_days — a demographic shift in new user signups. But Evidently couldn’t see it.

Here’s the broken monitoring code that passed CI:

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

def check_drift(train_path, prod_path):
    train_df = pd.read_parquet(train_path)
    prod_df = pd.read_parquet(prod_path)  # Schema: feature_v2_*

    report = Report(metrics=[DataDriftPreset()])
    report.run(reference_data=train_df, current_data=prod_df)

    # This returns None when no features match
    drift_results = report.as_dict()
    if drift_results is None:
        print("No drift detected")  # False confidence
        return False
    return True

The as_dict() method doesn’t fail when the report is empty. It just returns a structure with zero metrics. If you check drift_results['metrics'], it’s an empty list, but the code above never gets there.

The Fix: Explicit Column Mapping with Validation

The solution has two parts: explicit column renaming before Evidently, and post-run validation to catch silent failures.

First, standardize the schema. Don’t rely on Evidently’s ColumnMapping to handle renaming — it’s designed for metadata tagging, not ETL.

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
from evidently.pipeline.column_mapping import ColumnMapping

def normalize_schema(df, version='v2'):
    """Rename production features to match training schema."""
    rename_map = {
        f'feature_{version}_account_age': 'account_age_days',
        f'feature_{version}_total_spent': 'total_spent_usd',
        f'feature_{version}_sessions': 'session_count',
        # Add all features here
    }
    return df.rename(columns=rename_map)

def check_drift_with_validation(train_path, prod_path):
    train_df = pd.read_parquet(train_path)
    prod_df = pd.read_parquet(prod_path)

    # Explicit schema normalization
    prod_df = normalize_schema(prod_df, version='v2')

    # Define expected features explicitly
    numerical_features = ['account_age_days', 'total_spent_usd', 'session_count']
    categorical_features = ['subscription_tier', 'region', 'device_type']

    # Validate schema match BEFORE running Evidently
    missing_in_prod = set(numerical_features + categorical_features) - set(prod_df.columns)
    if missing_in_prod:
        raise ValueError(f"Production data missing features: {missing_in_prod}")

    column_mapping = ColumnMapping(
        target='churn_label',
        prediction='model_output',
        numerical_features=numerical_features,
        categorical_features=categorical_features
    )

    report = Report(metrics=[DataDriftPreset(), DataQualityPreset()])
    report.run(
        reference_data=train_df[numerical_features + categorical_features + ['churn_label']],
        current_data=prod_df[numerical_features + categorical_features + ['model_output']],
        column_mapping=column_mapping
    )

    # Validate report actually contains metrics
    results = report.as_dict()
    if not results or 'metrics' not in results or len(results['metrics']) == 0:
        raise RuntimeError("Evidently report is empty — check column mapping")

    # Check for actual drift
    drift_detected = any(
        m.get('result', {}).get('drift_detected', False)
        for m in results['metrics']
        if 'drift_detected' in m.get('result', {})
    )

    return drift_detected, results

The key changes:

  1. Explicit normalization before Evidently sees the data
  2. Schema validation that fails fast if features are missing
  3. Post-run validation to catch empty reports
  4. Scoped dataframes — only pass the columns Evidently should analyze, not the entire production log

This catches the schema mismatch immediately instead of silently producing a useless report.

Why DataQualityPreset Matters

I added DataQualityPreset() to the report metrics. This wasn’t in the original code, and it’s worth the extra compute.

DataQualityPreset checks for:

  • Missing values (ratio, count)
  • Duplicate rows
  • Constant features (zero variance)
  • Feature value ranges

These catch a different class of silent failures. If your production pipeline starts logging NaN for a feature (maybe an upstream service timeout), drift detection won’t flag it — it’s not distribution shift, it’s data corruption. But data quality metrics will.

Here’s what a quality issue looks like in the report:

quality_metrics = [
    m for m in results['metrics']
    if m['metric'] == 'DatasetMissingValuesMetric'
]

if quality_metrics:
    missing_stats = quality_metrics[0]['result']
    print(f"Missing values: {missing_stats['current']['number_of_missing_values']}")
    # Output: Missing values: 4720  (out of 10000 rows)

This saved me once when a third-party data provider changed their API response format, and a feature extraction function started returning None instead of 0.0 for missing fields. Drift detection wouldn’t catch that — it’s not a shift, it’s a schema break.

Close-up of a smartphone displaying ChatGPT app held over AI textbook.
Photo by Sanket Mishra on Pexels

Drift Metrics That Actually Matter

Evidently computes drift per feature using different statistical tests depending on the data type:

  • Numerical features: Kolmogorov-Smirnov test or Wasserstein distance
  • Categorical features: Chi-squared test or Jensen-Shannon divergence

The default threshold is p<0.05p < 0.05 for hypothesis tests. You can override this:

from evidently.options import DataDriftOptions

options = DataDriftOptions(threshold=0.01)  # More sensitive
report = Report(metrics=[DataDriftPreset()], options=options)

But here’s the thing: statistical significance doesn’t mean business impact. A feature can drift significantly (p=0.001p = 0.001) without affecting model accuracy if it’s not predictive. And a subtle drift (p=0.08p = 0.08) on a high-weight feature can tank performance.

I track both:

  1. Statistical drift from Evidently (binary flag per feature)
  2. Model performance drift from separate accuracy monitoring (F1 score, precision/recall on labeled prod data)

If performance drops without statistical drift, the issue is usually:

  • Label shift (class distribution changed, not features)
  • Interaction effects (two features shift in correlated ways that cancel out marginal drift)
  • Model staleness (the relationship between features and target changed, not the features themselves)

If statistical drift occurs without performance drop, I usually ignore it unless it’s in a known high-importance feature.

The Wasserstein Distance Insight

For numerical features, Evidently defaults to Kolmogorov-Smirnov (K-S) test. This checks if two distributions differ, but it’s sensitive to sample size and not very interpretable.

I switched to Wasserstein distance (also called Earth Mover’s Distance):

W1(P,Q)=FP(x)FQ(x)dxW_1(P, Q) = \int_{-\infty}^{\infty} |F_P(x) – F_Q(x)| dx

where FPF_P and FQF_Q are the cumulative distribution functions of the reference and current data.

Intuitively, it measures how much “work” is needed to reshape one distribution into the other. A Wasserstein distance of 5.2 means the distributions are, on average, 5.2 units apart in feature space. If the feature is account_age_days, that’s 5 days. That’s interpretable.

To use it in Evidently:

from evidently.calculations.data_drift import get_drift_for_columns
from scipy.stats import wasserstein_distance
import numpy as np

def custom_drift_check(train_df, prod_df, feature):
    train_vals = train_df[feature].dropna().values
    prod_vals = prod_df[feature].dropna().values

    wd = wasserstein_distance(train_vals, prod_vals)

    # Define threshold as percentage of feature range
    feature_range = train_vals.max() - train_vals.min()
    threshold = 0.1 * feature_range  # 10% of range

    drift_detected = wd > threshold

    return {
        'feature': feature,
        'wasserstein_distance': wd,
        'threshold': threshold,
        'drift_detected': drift_detected
    }

# Example output:
# {'feature': 'account_age_days', 'wasserstein_distance': 12.3, 'threshold': 18.0, 'drift_detected': False}

This gives me a concrete number I can track over time. If Wasserstein distance on account_age_days goes from 3.2 to 12.3 over two weeks, I know the user demographic is shifting, even if it hasn’t crossed the drift threshold yet.

Handling Categorical Drift

Categorical features need different treatment. Evidently uses chi-squared test by default:

χ2=i=1k(OiEi)2Ei\chi^2 = \sum_{i=1}^{k} \frac{(O_i – E_i)^2}{E_i}

where OiO_i is the observed count in category ii for current data, and EiE_i is the expected count based on reference data.

But this breaks when new categories appear in production. If training data has subscription_tier = ['free', 'pro', 'enterprise'] and production logs show a new 'trial' tier, the chi-squared test throws an error or ignores it.

The fix: explicitly handle unknown categories before running drift detection.

def align_categorical_values(train_df, prod_df, cat_features):
    """Ensure production categorical features only contain training-time values."""
    aligned_prod = prod_df.copy()

    for feat in cat_features:
        train_categories = set(train_df[feat].dropna().unique())
        prod_categories = set(prod_df[feat].dropna().unique())

        new_categories = prod_categories - train_categories
        if new_categories:
            print(f"Warning: New categories in {feat}: {new_categories}")
            # Map unknown categories to a special '<unknown>' value
            aligned_prod[feat] = aligned_prod[feat].apply(
                lambda x: x if x in train_categories else '<unknown>'
            )

    return aligned_prod

This logs the new categories (which might indicate a pipeline change or new product tier) but allows drift detection to continue. The <unknown> category will show up in drift metrics if it’s becoming a significant fraction of production data.

Airflow Integration (The Part That Actually Runs)

Here’s how this runs in production:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
import json

def drift_check_task():
    drift_detected, results = check_drift_with_validation(
        train_path='s3://ml-artifacts/train_2024_01.parquet',
        prod_path='s3://ml-logs/prod_inference_last_7d.parquet'
    )

    if drift_detected:
        # Slack alert, PagerDuty, whatever
        alert_ops_team(results)

    # Log to Prometheus for time-series tracking
    log_drift_metrics(results)

with DAG(
    dag_id='model_drift_monitoring',
    schedule_interval='0 2 * * *',  # Daily at 2am
    start_date=days_ago(1),
    catchup=False,
    default_args={'retries': 2, 'retry_delay': timedelta(minutes=5)}
) as dag:

    drift_check = PythonOperator(
        task_id='run_drift_check',
        python_callable=drift_check_task
    )

The key detail: schedule_interval='0 2 * * *' runs this daily. Why 2am? Because production inference logs are batched hourly, and the last batch of the day finishes around 1:30am. Running at 2am ensures a full 24 hours of data.

I’ve seen setups that run drift checks every hour. That’s overkill unless you’re in high-frequency trading or fraud detection. Distribution shift happens over days/weeks, not hours.

When Drift Detection Lies

Drift detection can fail in both directions:

  1. False positives: Drift detected, but model performance is fine
  2. False negatives: No drift detected, but model is degrading

False positives happen when:

  • Seasonal patterns (holidays, weekends) shift feature distributions temporarily
  • A/B tests introduce synthetic distribution changes (new user cohorts)
  • Sample size differences between training and production windows

False negatives happen when:

  • Drift is localized to a small subpopulation (say, 5% of users)
  • Features drift in correlated ways that preserve marginal distributions
  • The drift is in feature interactions, not individual features

I don’t have a perfect solution for this. My current approach: track drift AND performance. If drift is detected, I check model accuracy on a labeled holdout set from the production window. If accuracy is stable, I investigate but don’t retrain immediately. If accuracy drops without drift, I look for label shift or interaction effects.

The Cost of Monitoring

Evidently’s drift calculation is O(nlogn)O(n \log n) for numerical features (sorting for K-S test or Wasserstein) and O(n)O(n) for categorical features (counting). For 10,000 production samples and 50 features, this takes about 2 seconds on a single core.

But the real cost is data storage. I’m logging 7 days of production inference (about 70,000 rows) to S3 for drift checks. That’s 200MB compressed per week. Over a year, that’s 10GB per model. With 12 models in production, that’s 120GB/year, which costs about $3/month on S3 Standard.

Not huge, but it adds up. I switched to S3 Intelligent-Tiering, which moves old data to cheaper tiers automatically. Now it’s about $1.20/month.

The Airflow DAG runs for ~30 seconds daily (read from S3, run Evidently, write metrics to Prometheus). That’s negligible compute cost.

FAQ

Q: Should I run drift detection on all features or just the important ones?

Start with the top 10-15 features by model importance (from SHAP or feature importance scores). Monitoring 50+ features generates alert fatigue — you’ll get drift warnings constantly, and most won’t matter. I track the top 10 features for drift, plus data quality metrics on all features (to catch pipeline breaks).

Q: What’s a reasonable drift threshold?

Depends entirely on your model and risk tolerance. For a churn prediction model, I use p<0.01p < 0.01 (strict) on the top 3 features and p<0.05p < 0.05 (default) on the rest. For a recommendation system, I use Wasserstein distance > 10% of feature range, because small shifts don’t hurt relevance much. If you’re not sure, start with defaults and tune based on false positive rate over a month.

Q: How often should I retrain when drift is detected?

Drift detection is an early warning, not a retrain trigger. I retrain when BOTH (a) drift is detected on multiple important features, AND (b) model performance drops below threshold on labeled production data. If only drift is detected, I wait a week to see if it’s temporary. If only performance drops, I investigate for data quality issues or label shift before retraining on drifted features.

Use Evidently, But Validate Everything

Evidently AI is solid for drift detection once you fix the schema validation gap. But don’t trust it blindly — silent failures are worse than loud crashes.

The pattern that works: explicit schema normalization before Evidently, post-run validation to catch empty reports, and performance monitoring alongside drift detection. Track Wasserstein distance for interpretability, not just binary drift flags.

I’d still pick Evidently over building custom drift detection from scratch. The statistical tests are solid, the HTML reports are actually readable by non-ML stakeholders, and the integration with MLflow/Airflow is straightforward once you get past the schema quirks.

What I haven’t solved yet: detecting drift in feature interactions (e.g., age * income shifts even though age and income individually don’t). My current approach just monitors the top interactions manually, but there’s probably a better way. If you’ve cracked this, I’d love to hear it.

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 382 | TOTAL 5,421