W&B Artifacts and Team Collaboration: Model Version Control

⚡ Key Takeaways
  • W&B Artifacts provide Git-like version control for models, datasets, and preprocessors with automatic lineage tracking.
  • Reports auto-update with live experiment data, eliminating the need for manually maintained documentation.
  • Team features like run grouping, job types, and artifact aliases prevent coordination chaos in multi-user projects.
  • Service accounts enable secure CI/CD logging, and the Table API supports custom dashboards via programmatic run queries.
  • Artifact metadata (user, timestamp, Git commit) creates an audit trail even without Enterprise features.

Artifacts Saved Me From a 3AM Model Rollback

Most teams version their code with Git but track models in a Slack thread titled “final_model_v2_actually_final.pth”. W&B Artifacts fixes this — not by reinventing version control, but by making it seamless for datasets, models, and preprocessors. After watching a production model mysteriously degrade (spoiler: someone swapped the tokenizer without logging it), I started treating artifacts like commits. The difference? You can artifact.download() the exact model from three weeks ago in one line, no S3 bucket archaeology required.

This is Part 3 of the W&B series. Part 1 covered experiment tracking basics — logging metrics, hyperparameters, and system stats. Part 2 dove into sweeps for hyperparameter optimization. Now we’re looking at the production-grade features: versioned artifacts, shareable reports, and team workflows that don’t require a PhD in MLOps.

Abstract image representing the concept of a multimodal model version 2.
Photo by Google DeepMind on Pexels

Artifacts: Git for Everything That Isn’t Code

Artifacts let you version datasets, models, and any other file with the same rigor as Git, but with metadata baked in. The API is simple:

import wandb

run = wandb.init(project="text-classifier", job_type="train")

# Log a trained model as an artifact
model_artifact = wandb.Artifact(
    name="bert-classifier",
    type="model",
    description="BERT finetuned on customer support tickets",
    metadata={"epochs": 10, "f1_score": 0.89, "framework": "transformers==4.36.0"}
)
model_artifact.add_file("model.pth")
run.log_artifact(model_artifact)

run.finish()

Every time you log an artifact with the same name, W&B creates a new version (v0, v1, v2…). The metadata dictionary is key — I stuff it with everything I’ll forget in two weeks: preprocessing steps, library versions, validation metrics, even the Git commit SHA of the training script.

Downloading Artifacts in Production

The real power is retrieval. Instead of hardcoding S3 paths or hunting through experiment logs, you fetch artifacts by name and version:

run = wandb.init(project="text-classifier", job_type="inference")

# Download the latest version of the model
artifact = run.use_artifact("bert-classifier:latest", type="model")
artifact_dir = artifact.download()

# Or pin a specific version for reproducibility
artifact_v3 = run.use_artifact("bert-classifier:v3", type="model")
model_path = artifact_v3.download()

print(f"Model metadata: {artifact_v3.metadata}")
# Output: {'epochs': 10, 'f1_score': 0.89, 'framework': 'transformers==4.36.0'}

The :latest alias always points to the most recent version. For production, I pin a specific version (v3, v5) after validation — this prevents “it worked yesterday” bugs when someone uploads a new model.

Dataset Versioning That Survives Preprocessing Changes

Datasets are messier than models. You add samples, fix labels, rebalance classes, and suddenly your validation split isn’t comparable to last month’s. Artifacts track all of it:

# Log a dataset with lineage tracking
raw_data_artifact = wandb.Artifact(
    name="support-tickets",
    type="dataset",
    description="Customer support tickets (raw)",
    metadata={"samples": 50000, "date_range": "2024-01-01 to 2024-12-31"}
)
raw_data_artifact.add_dir("data/raw/")
run.log_artifact(raw_data_artifact)

# Later: log the preprocessed version
run_preprocess = wandb.init(project="text-classifier", job_type="preprocess")
raw_artifact = run_preprocess.use_artifact("support-tickets:v0")

processed_artifact = wandb.Artifact(
    name="support-tickets-clean",
    type="dataset",
    metadata={"samples": 48500, "removed_duplicates": 1500}
)
processed_artifact.add_dir("data/processed/")
run_preprocess.log_artifact(processed_artifact)

W&B automatically tracks the lineage: support-tickets-clean:v0 shows it was derived from support-tickets:v0. This is crucial when debugging — if your model tanks, you can trace backward through the artifact graph to see if someone quietly changed the tokenization.

Artifact Aliases for Staging Pipelines

Aliases let you tag versions semantically instead of remembering “v12 was the good one”:

# Mark a model as production-ready
artifact = run.use_artifact("bert-classifier:v5")
artifact.aliases.append("production")
artifact.save()

# Now inference scripts can use the alias
prod_model = run.use_artifact("bert-classifier:production")

We use three aliases: dev, staging, production. Promoting a model is just moving the alias — no config file edits, no S3 bucket renaming. If staging breaks, you revert the alias, not the code.

Reports: The Documentation You’ll Actually Maintain

Reports are W&B’s answer to “how do I share this experiment with the team without writing a 10-page doc?” They’re interactive dashboards you build by dragging in runs, charts, and Markdown. The killer feature: they auto-update as new runs complete.

Building a Report Programmatically

You can create reports in the W&B UI (drag-and-drop panels), but I prefer the Python API for reproducibility:

import wandb

api = wandb.Api()
runs = api.runs("username/text-classifier", filters={"state": "finished"})

# Create a report
report = wandb.Report(
    project="text-classifier",
    title="BERT Finetuning Results: Week 6",
    description="Comparison of learning rates and batch sizes"
)

# Add a panel showing loss curves for all runs
report.blocks = [
    wandb.report.PanelGrid(
        panels=[
            wandb.report.LinePlot(
                x="_step",
                y=["train/loss", "val/loss"],
                title="Training Loss by Learning Rate",
                groupby="config.learning_rate"
            )
        ]
    )
]

report.save()
print(f"Report URL: {report.url}")

The groupby parameter is where it gets interesting — you can split plots by any hyperparameter or tag. I often group by config.architecture to compare ResNet50 vs EfficientNet runs side-by-side.

Embedding Custom Visualizations

Reports support Markdown blocks, so you can mix prose with Plotly charts or even wandb.Html() for custom JavaScript:

import plotly.graph_objects as go

# Generate a custom confusion matrix plot
fig = go.Figure(data=go.Heatmap(
    z=[[50, 10], [5, 35]],
    x=["Spam", "Not Spam"],
    y=["Spam", "Not Spam"],
    colorscale="Blues"
))

# Log it to a run, then embed in a report
run.log({"confusion_matrix": wandb.Plotly(fig)})

# In the report UI, add a "Run" panel and select the confusion_matrix metric

The W&B UI doesn’t natively render all plot types (e.g., 3D scatter plots), but Plotly integration covers 90% of use cases. For the remaining 10%, I render to PNG and log as wandb.Image().

Why Reports Beat Google Docs

Every team has a “Model Training Log” doc that’s three months out of date. Reports fix this by pulling live data from W&B runs. If you rerun an experiment, the report updates automatically — no copy-pasting screenshots. The shareable URL also handles permissions (viewers see runs they have access to), so you’re not maintaining two versions for internal/external stakeholders.

Team Collaboration: Multi-User Projects Without the Chaos

Once you have multiple people logging to the same project, coordination becomes a problem. W&B handles this with organizations, teams, and run grouping.

Organizations and Teams

Organizations are the top-level entity (usually your company). Within an org, you create teams with shared projects:

# Create a team (done in the W&B UI or via API)
wandb team create ml-research

# Invite members
wandb team add ml-research user@example.com

Team members see all runs in shared projects, but runs are still tied to the individual who started them (via wandb.init(entity="team-name")). This matters for billing and access control — if someone leaves, their runs don’t vanish.

Run Grouping and Tags

With a dozen people logging runs, the project page becomes a firehose. Groups and tags provide structure:

run = wandb.init(
    project="text-classifier",
    group="ablation-study",  # Group related runs
    tags=["bert", "lr-sweep", "bug-fix"],  # Searchable tags
    notes="Testing if dropout=0.3 fixes overfitting"
)

The group parameter is hierarchical — I use it for experiments that span multiple runs (e.g., a hyperparameter sweep). Tags are freeform and searchable in the UI. The notes field is a lifesaver when you return to a run six months later and can’t remember why you set warmup_steps=500.

Preventing Run Collisions with Job Types

If two people train on the same dataset simultaneously, their runs will interleave in the UI. The job_type parameter separates workflows:

# Person A: training
run = wandb.init(project="text-classifier", job_type="train")

# Person B: evaluating on a test set
run = wandb.init(project="text-classifier", job_type="eval")

The UI filters runs by job type, so training runs don’t clutter the eval dashboard. We use five job types: preprocess, train, eval, deploy, debug. The last one is for “I’m testing logging and will delete this in 10 minutes” runs.

Shared Artifacts Across Teams

Artifacts are scoped to projects, but you can reference artifacts from other projects:

# Team A logs a pretrained model
run_a = wandb.init(project="team-a/pretraining", job_type="train")
artifact = wandb.Artifact("bert-base", type="model")
artifact.add_file("bert.pth")
run_a.log_artifact(artifact)

# Team B downloads it for finetuning
run_b = wandb.init(project="team-b/finetuning", job_type="train")
base_model = run_b.use_artifact("team-a/pretraining/bert-base:latest")
model_path = base_model.download()

The full path format is entity/project/artifact-name:version. This lets you build pipelines where one team’s output is another team’s input, without duplicating 2GB model files.

Service Accounts for CI/CD Pipelines

Human users shouldn’t share API keys, but CI jobs need to log runs. W&B supports service accounts — bot users with limited permissions:

# Create a service account (W&B UI → Settings → Service Accounts)
# Generate an API key for the bot

# In your CI environment (e.g., GitHub Actions)
export WANDB_API_KEY="<service-account-key>"
python train.py  # Runs will appear under the service account's name

Service accounts can’t access the UI (they’re API-only), which prevents accidental manual edits. We use one service account per deployment environment (ci-dev, ci-staging, ci-prod) to track which pipeline produced each artifact.

Table API for Custom Dashboards

The W&B UI is great for exploratory analysis, but sometimes you need a custom dashboard. The Table API lets you query runs and build your own:

import wandb
import pandas as pd

api = wandb.Api()
runs = api.runs("username/text-classifier")

# Extract hyperparameters and metrics into a DataFrame
data = []
for run in runs:
    if run.state == "finished":
        data.append({
            "name": run.name,
            "learning_rate": run.config.get("learning_rate"),
            "batch_size": run.config.get("batch_size"),
            "val_f1": run.summary.get("val/f1"),
            "runtime_minutes": (run.summary.get("_runtime") or 0) / 60
        })

df = pd.DataFrame(data)
print(df.sort_values("val_f1", ascending=False).head(10))

The run.summary dict contains the final logged value for each metric. For time-series data (e.g., loss per epoch), use run.history():

history = run.history(keys=["train/loss", "val/loss"])
print(history.head())  # Returns a Pandas DataFrame

I use this for weekly reports — pull the top 10 runs by validation accuracy, generate a Markdown table, and Slack it to the team. Beats manually screenshotting the W&B UI.

Audit Logs and Compliance

For regulated industries (healthcare, finance), you need an audit trail of who logged what. W&B Enterprise includes audit logs, but the free tier has a workaround: log metadata in every artifact:

import getpass
import socket
from datetime import datetime

artifact = wandb.Artifact(
    name="patient-classifier",
    type="model",
    metadata={
        "user": getpass.getuser(),
        "hostname": socket.gethostname(),
        "timestamp": datetime.utcnow().isoformat(),
        "git_commit": "abc123",  # From subprocess.check_output(["git", "rev-parse", "HEAD"])
        "val_auc": 0.94
    }
)

This doesn’t prevent deletion (Enterprise has immutable artifacts for that), but it creates a paper trail. If someone questions a model’s provenance, you can trace it back to the exact user, machine, and Git commit.

What I’d Change About W&B

Artifact storage costs add up fast. A single model checkpoint is 500MB; log 100 versions across a team, and you’re at 50GB. W&B charges $0.10/GB/month after the free tier (100GB), which is reasonable but sneaks up on you. I’ve started using artifact TTLs (time-to-live) to auto-delete old versions:

artifact = wandb.Artifact("temp-model", type="model")
artifact.ttl = wandb.util.parse_time("7d")  # Delete after 7 days

The API is also inconsistent in places — run.config is a dict-like object but doesn’t support .get() with a default in some versions (raises KeyError instead). I wrap all config access in try-except or use run.config.as_dict().get("key", default).

Reports are powerful but slow to render with 500+ runs. The UI times out after ~30 seconds, so I filter runs by date range or tag before building reports.

When to Use Artifacts vs MLflow Model Registry

If you’re already using MLflow, its model registry overlaps with W&B Artifacts. The difference:

  • MLflow: Designed for model deployment (staging/production transitions, webhooks for CI/CD).
  • W&B Artifacts: Designed for versioning anything (datasets, preprocessors, models) with lineage tracking.

I use both — MLflow for the final deployment step (it integrates better with SageMaker/Azure ML), W&B for everything upstream. The artifact lineage graph is unmatched for debugging “why did accuracy drop in v12?” questions.

For small teams without complex deployment pipelines, W&B alone is sufficient. You’d alias models as production and have your serving script fetch model:production on startup.

The Workflow That Stuck

After a year of iteration, here’s what works for my team:

  1. Preprocess: Log raw dataset as artifact → run preprocessing → log clean dataset with lineage.
  2. Train: Log model checkpoints as artifacts every 5 epochs (with metadata: val loss, learning rate, Git SHA).
  3. Eval: Download model:staging artifact → evaluate on test set → log metrics to a new run with job_type="eval".
  4. Deploy: Move staging alias to production if test metrics pass threshold.
  5. Report: Auto-generate weekly report showing top 5 runs by val accuracy, grouped by architecture.

The key insight: treat artifacts like Git commits. You wouldn’t commit code without a message — don’t log artifacts without metadata. Future you will thank present you.

I’m still figuring out multi-modal artifact tracking (e.g., image dataset + text captions + trained model as a single logical unit). W&B supports this via artifact references, but the UX is clunky. If you’ve solved this elegantly, I’d love to hear how — the lineage graph gets messy fast when three artifacts feed into one training run.

Weights & Biases Tutorial Series (3/3)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 441 | TOTAL 5,480