DVC vs MLflow for Data Science Project Versioning: Git-Based Pipelines and Model Registry Compared

,
⚡ Key Takeaways
  • DVC excels at Git-native data versioning and reproducible pipelines using content-addressable storage and DAG-based stage caching.
  • MLflow wins for experiment tracking and model lifecycle management with searchable run databases and alias-based model promotion.
  • The most effective setup for teams uses both: DVC for data/pipeline versioning and MLflow for experiment comparison and model registry.
  • DVC experiments (dvc exp) work for quick parameter sweeps but lack MLflow's mid-run logging and query flexibility across hundreds of runs.
  • Infrastructure cost differs sharply: DVC needs only remote storage and Git, while MLflow requires a tracking server with database and authentication.

The Setup That Actually Works

Most data science teams start versioning their models the same way: a folder called models_final_v2_REAL_final/ and a spreadsheet tracking which hyperparameters produced which accuracy. It works until it doesn’t, which is usually around the third team member or the second production deployment.

Here’s the dvc.yaml pipeline I ended up with after trying both DVC and MLflow on a tabular churn prediction project:

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/customers.csv
    params:
      - prepare.test_ratio
      - prepare.seed
    outs:
      - data/prepared/

  featurize:
    cmd: python src/featurize.py
    deps:
      - src/featurize.py
      - data/prepared/
    params:
      - featurize.n_quantiles
      - featurize.handle_unknown
    outs:
      - data/features/

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/features/
    params:
      - train.n_estimators
      - train.max_depth
      - train.learning_rate
    outs:
      - models/model.pkl
    metrics:
      - metrics/scores.json:
          cache: false
    plots:
      - metrics/roc.json:
          x: fpr
          y: tpr

And the corresponding MLflow tracking code that does roughly the same job:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score
import json
import pandas as pd

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("churn-prediction")

X_train = pd.read_parquet("data/features/X_train.parquet")
y_train = pd.read_parquet("data/features/y_train.parquet").values.ravel()
X_test = pd.read_parquet("data/features/X_test.parquet")
y_test = pd.read_parquet("data/features/y_test.parquet").values.ravel()

params = {
    "n_estimators": 200,
    "max_depth": 5,
    "learning_rate": 0.1,
    "subsample": 0.8,
}

with mlflow.start_run(run_name="gbm-baseline"):
    mlflow.log_params(params)

    clf = GradientBoostingClassifier(**params, random_state=42)
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    y_prob = clf.predict_proba(X_test)[:, 1]

    auc = roc_auc_score(y_test, y_prob)
    f1 = f1_score(y_test, y_pred)

    mlflow.log_metric("auc", auc)
    mlflow.log_metric("f1", f1)
    mlflow.log_metric("train_samples", len(X_train))

    # log the model with input example for schema inference
    mlflow.sklearn.log_model(
        clf,
        "model",
        input_example=X_train.iloc[:3],
        registered_model_name="churn-classifier",
    )

    print(f"AUC: {auc:.4f}, F1: {f1:.4f}")
    # AUC: 0.8734, F1: 0.6821

Both get the job done. But they solve fundamentally different problems, and picking the wrong one costs you weeks of migration later.

Close-up of vibrant HTML code displayed on a computer screen, showcasing web development and programming.
Photo by Pixabay on Pexels

What DVC Actually Is (And Isn’t)

DVC (Data Version Control, dvc.org) sits on top of Git. That’s not a marketing phrase — it literally stores .dvc metafiles in your Git repo while pushing the actual large files (datasets, model weights) to remote storage like S3, GCS, or even a local NAS.

The mental model: Git tracks code and small config files. DVC tracks everything else, using the same branching and tagging semantics.

# initialize DVC in an existing git repo
git init && dvc init

# track a large dataset
dvc add data/raw/customers.csv
# creates data/raw/customers.csv.dvc (small pointer file)
# adds data/raw/customers.csv to .gitignore automatically

git add data/raw/customers.csv.dvc data/raw/.gitignore
git commit -m "track raw customer data"

# push data to remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push

What surprised me: the .dvc file is just a YAML file with an MD5 hash. The hash h=MD5(file_contents)h = \text{MD5}(\text{file\_contents}) serves as a content-addressable key in your remote storage. When you dvc checkout, it reads the hash from the .dvc file and pulls the corresponding blob. Simple, deterministic, no magic.

# data/raw/customers.csv.dvc
outs:
- md5: a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4
  size: 48372615
  hash: md5
  path: customers.csv

But DVC’s pipeline feature (dvc.yaml + dvc repro) is where it gets interesting. Each stage declares its dependencies, parameters, and outputs. DVC builds a DAG and only reruns stages whose inputs have changed. The cache key for each stage is essentially:

cache_key=hash(depsparamscmd)\text{cache\_key} = \text{hash}(\text{deps} \| \text{params} \| \text{cmd})

Change a hyperparameter in params.yaml, run dvc repro, and only train reruns. Change the preprocessing code, and featurize + train both rerun. It’s Make for ML pipelines, and it works remarkably well for solo or small-team projects.

Where MLflow Fills a Different Gap

MLflow (mlflow.org) isn’t trying to be Git for data. It’s an experiment tracker and model registry that happens to store artifacts. The distinction matters.

With DVC, your experiment history is your Git history. Want to compare last Tuesday’s model with today’s? You checkout that Git commit, dvc checkout, and you’re back in time. Clean, but slow if you ran 200 experiments.

MLflow stores every run in a database (SQLite by default, PostgreSQL in production). Comparing 200 runs is a query, not 200 git checkouts.

import mlflow

# find top 5 runs by AUC
runs = mlflow.search_runs(
    experiment_names=["churn-prediction"],
    filter_string="metrics.auc > 0.85",
    order_by=["metrics.auc DESC"],
    max_results=5,
)
print(runs[["run_id", "params.n_estimators", "params.max_depth", "metrics.auc", "metrics.f1"]])

Output:

                             run_id params.n_estimators params.max_depth  metrics.auc  metrics.f1
0  a3f2e1d0c9b8a7f6e5d4c3b2a1f0e9d8                 300                7       0.8912      0.7103
1  b4c3d2e1f0a9b8c7d6e5f4a3b2c1d0e9                 200                5       0.8734      0.6821
2  c5d4e3f2a1b0c9d8e7f6a5b4c3d2e1f0                 500                4       0.8698      0.6944
3  d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1                 150                6       0.8655      0.6712
4  e7f6a5b4c3d2e1f0a9b8c7d6e5f4a3b2                 250                5       0.8601      0.6589

That searchability is the killer feature. When you’re doing hyperparameter sweeps, nobody wants to grep through Git logs.

The Model Registry

MLflow’s model registry adds lifecycle management that DVC doesn’t attempt. You can transition models through stages:

from mlflow import MlflowClient

client = MlflowClient()

# register a model version from a run
result = client.create_model_version(
    name="churn-classifier",
    source="runs:/a3f2e1d0c9b8a7f6e5d4c3b2a1f0e9d8/model",
    run_id="a3f2e1d0c9b8a7f6e5d4c3b2a1f0e9d8",
)

# promote to production (MLflow 2.x uses aliases instead of stages)
client.set_registered_model_alias(
    name="churn-classifier",
    alias="production",
    version=result.version,
)

# load production model anywhere
import mlflow.pyfunc
model = mlflow.pyfunc.load_model("models:/churn-classifier@production")
predictions = model.predict(new_data)

Note: MLflow 2.x deprecated the old Stage concept (“Staging”, “Production”, “Archived”) in favor of aliases. The migration caught a lot of people off guard — if you’re reading older tutorials that use transition_model_version_stage, those still work but throw FutureWarning on MLflow 2.9+.

The Comparison That Matters

Forget feature matrices with 30 rows. Here’s what actually differs in practice:

Data versioning: DVC wins outright. MLflow can log artifacts, but it treats them as opaque blobs attached to runs. DVC understands data lineage — which processing step produced which file, and whether anything changed since last run. If your dataset is 50GB of Parquet files and you change one column’s encoding, DVC’s cache means you only reprocess downstream stages. MLflow would just re-upload the whole thing.

Experiment comparison: MLflow wins outright. DVC added dvc exp commands (experiment tracking within branches), but the UX is clunky compared to MLflow’s UI. Running dvc exp show gives you a table, but filtering, sorting, and visualizing across hundreds of runs requires either dvc plots (which generates static HTML) or the paid DVC Studio.

# DVC experiment tracking
dvc exp run -S train.n_estimators=300 -S train.max_depth=7
dvc exp run -S train.n_estimators=500 -S train.max_depth=4
dvc exp show --md

This outputs a markdown table, which is fine for 5 experiments. For 50, you want MLflow’s dashboard.

Pipeline reproducibility: DVC. The DAG in dvc.yaml is declarative and deterministic. MLflow Projects (the MLproject file) exist, but they’re really just a way to package conda environments with entry points. They don’t track intermediate artifacts or build dependency graphs.

Team collaboration: Depends on your infrastructure. DVC needs shared remote storage + Git — your team already knows Git, so the learning curve is manageable. MLflow needs a tracking server, which means someone has to maintain a service. On the flip side, MLflow’s server gives you a web UI that non-engineers can browse.

Abstract image representing the concept of a multimodal model version 2.
Photo by Google DeepMind on Pexels

Running Both Together

Here’s the thing nobody mentions in the “DVC vs MLflow” articles: they’re not mutually exclusive. I’d argue the best setup for teams larger than 2-3 people uses both.

DVC handles data versioning and pipeline orchestration. MLflow handles experiment tracking and model registry. The integration point is your training script:

# src/train.py — DVC runs this, MLflow tracks it
import mlflow
import yaml
import json
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score, precision_recall_curve, auc
import warnings

# DVC provides the data through pipeline deps
X_train = pd.read_parquet("data/features/X_train.parquet")
y_train = pd.read_parquet("data/features/y_train.parquet").values.ravel()
X_test = pd.read_parquet("data/features/X_test.parquet")
y_test = pd.read_parquet("data/features/y_test.parquet").values.ravel()

# DVC manages params.yaml
with open("params.yaml") as f:
    params = yaml.safe_load(f)["train"]

# sanity check — caught a nasty bug where y_train was all zeros
# after a bad merge in the prepare stage
assert y_train.mean() > 0.01, f"Label rate suspiciously low: {y_train.mean():.4f}"

with mlflow.start_run():
    mlflow.log_params(params)
    mlflow.log_param("train_samples", len(X_train))
    mlflow.log_param("feature_count", X_train.shape[1])

    # suppress the convergence warning that GBM throws
    # when n_estimators is too low for the data
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=UserWarning)
        clf = GradientBoostingClassifier(**params, random_state=42)
        clf.fit(X_train, y_train)

    y_prob = clf.predict_proba(X_test)[:, 1]
    y_pred = (y_prob >= 0.5).astype(int)

    roc_auc = roc_auc_score(y_test, y_prob)
    f1 = f1_score(y_test, y_pred)

    # PR-AUC matters more for imbalanced churn data
    precision, recall, _ = precision_recall_curve(y_test, y_prob)
    pr_auc = auc(recall, precision)

    mlflow.log_metrics({"roc_auc": roc_auc, "f1": f1, "pr_auc": pr_auc})
    mlflow.sklearn.log_model(clf, "model")

    # DVC tracks metrics via this file
    scores = {"roc_auc": round(roc_auc, 4), "f1": round(f1, 4), "pr_auc": round(pr_auc, 4)}
    with open("metrics/scores.json", "w") as f:
        json.dump(scores, f, indent=2)

    print(f"ROC-AUC: {roc_auc:.4f} | F1: {f1:.4f} | PR-AUC: {pr_auc:.4f}")
    # ROC-AUC: 0.8734 | F1: 0.6821 | PR-AUC: 0.7456

The dual-write to both metrics/scores.json (for DVC) and MLflow feels redundant, and it is. But dvc metrics diff gives you quick command-line comparison between Git branches, while MLflow gives you the full experiment history with a web UI. Different audiences, different workflows.

The Data Gotcha Nobody Warns You About

When I first set up DVC with S3 remote storage, the .dvc files tracked fine in Git. But running dvc pull on a teammate’s machine produced a dataset with different row ordering. The underlying Parquet files were identical (same MD5), but the pandas DataFrame came out shuffled because we were reading from a partitioned dataset and the partition order wasn’t guaranteed.

The fix was embarrassingly simple — add an explicit sort in the prepare stage:

# src/prepare.py
df = pd.read_parquet("data/raw/customers/")
# partitioned parquet doesn't guarantee row order
df = df.sort_values("customer_id").reset_index(drop=True)
print(f"Loaded {len(df)} rows, {df.shape[1]} columns")
print(f"First 3 IDs: {df['customer_id'].head(3).tolist()}")
# Loaded 48291 rows, 23 columns
# First 3 IDs: [10001, 10002, 10003]

This kind of non-determinism doesn’t show up in unit tests. The model trains fine either way. But if you’re computing metrics to 4 decimal places and comparing across runs, a shuffled train/test split produces slightly different numbers every time. Enough to make you question whether your code change actually improved anything or if it’s just noise.

The broader point: DVC guarantees bit-identical data files. It does not guarantee identical DataFrame state after loading. That’s on you.

Storage and Cost Math

Let’s talk numbers. For a moderately sized project with 10GB of versioned data and 50 experiment runs:

DVC remote storage (S3): each unique version of a file is stored once, deduplicated by content hash. If you modify 1% of your data, the new version is stored alongside the old one. Total storage after 20 versions of a 10GB dataset where each version changes ~5% of rows:

SDVC10GB+19×10GB=200GBS_{\text{DVC}} \approx 10\text{GB} + 19 \times 10\text{GB} = 200\text{GB}

DVC doesn’t do delta compression at the storage level — each .dvc tracked file is stored in full. There’s an open issue about this. My best guess is they’re prioritizing simplicity over storage efficiency, which makes sense for most teams where S3 costs are negligible.

MLflow artifact storage for 50 runs, each logging a ~500MB model + 100MB of plots and data samples:

SMLflow50×600MB=30GBS_{\text{MLflow}} \approx 50 \times 600\text{MB} = 30\text{GB}

Plus the tracking database, which is tiny (a few hundred MB even with thousands of runs).

Where it gets expensive is when teams log large artifacts to MLflow per run without thinking. I’ve seen MLflow artifact stores balloon to 500GB+ because someone logged the entire training dataset as an artifact on every run “for reproducibility.” Don’t do that — that’s what DVC is for.

The dvc exp vs MLflow Tracking Decision

DVC added experiment tracking in version 2.0, and it’s gotten better since. The core idea: experiments are lightweight Git commits that don’t pollute your branch history. You run dvc exp run with modified parameters, and DVC creates a hidden ref under .git/refs/exps/.

# run 3 experiments with different hyperparameters
dvc exp run -S train.learning_rate=0.05
dvc exp run -S train.learning_rate=0.1
dvc exp run -S train.learning_rate=0.2

# compare them
dvc exp show --only-changed
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment     ┃ roc_auc       ┃ f1    ┃ pr_auc ┃ train.learning_rate┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ workspace      │ 0.8734        │ 0.6821│ 0.7456 │ 0.1                │
│ ├── exp-a1b2c  │ 0.8612        │ 0.6654│ 0.7289 │ 0.05               │
│ ├── exp-d3e4f  │ 0.8734        │ 0.6821│ 0.7456 │ 0.1                │
│ └── exp-g5h6i  │ 0.8503        │ 0.6401│ 0.7102 │ 0.2                │
└───────────────┴───────────────┴───────┴────────┴────────────────────┘

This is fine for quick sweeps. But why does MLflow still win for heavy experimentation? Two reasons.

First, MLflow logs arbitrary key-value pairs mid-run. You can log per-epoch loss curves, custom artifacts at any point, and system metrics. DVC experiments capture a snapshot at completion — you get final metrics, not the training trajectory.

Second, querying. Finding “all runs where learning_rate was 0.1 and AUC exceeded 0.87” is a one-liner with mlflow.search_runs(). With DVC, you’d parse dvc exp show --json output and filter it yourself.

When to Use What

Solo data scientist, dataset under 5GB, fewer than 20 experiments per project: DVC alone. The Git-native workflow is clean, there’s no server to maintain, and dvc repro gives you reproducible pipelines with zero infrastructure.

Team of 3+, active experimentation with 50+ runs per week, need a model registry: MLflow tracking server + DVC for data versioning. Use DVC to version datasets and define pipelines. Use MLflow to track experiments and manage model deployment lifecycle.

ML platform team serving models in production: MLflow’s model registry with aliases, plus whatever CI/CD you prefer. DVC pipelines can trigger training, but the model artifacts flow through MLflow’s registry into your serving infrastructure.

And if you’re just starting a project and aren’t sure? Start with DVC. It’s pip install dvc and three commands to get going. You can add MLflow tracking later without changing your pipeline structure — just add mlflow.log_* calls to your training script.

The Infrastructure Tax

One thing I haven’t fully figured out: the ideal MLflow deployment for small teams. SQLite works for one person. PostgreSQL + S3 backend works for larger teams but requires ops work. MLflow’s built-in mlflow server command is meant for development, not production — it’s single-process, no auth, no TLS.

Databricks offers managed MLflow, which eliminates the ops burden but locks you into their ecosystem. There are also open-source deployment guides using Docker + Nginx + Let’s Encrypt, but maintaining that is real work on a small team.

DVC’s infrastructure story is simpler: you need a remote storage backend (S3 bucket, GCS bucket, or even an SSH server) and Git. That’s it. No server process to keep alive, no database migrations, no authentication layer to configure.

What I’m curious about going forward is whether DVC Studio (Iterative’s hosted platform) will close the experiment tracking gap enough to make MLflow unnecessary for mid-size teams. The latest versions added live experiment tracking and model registry features, but I haven’t tested them in a team setting. The evaluation metric that would convince me to switch is whether cross-run comparison — filtering by AUC>τ\text{AUC} > \tau across hundreds of experiments with different feature sets — feels as fluid as MLflow’s search API. Until then, the DVC-for-data-plus-MLflow-for-experiments combination remains the setup I’d recommend for any team that’s outgrown Jupyter notebooks and shared drives.

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 371 | TOTAL 2,594