ML Experiment Tracking with W&B: Lessons from 50 Failed Runs

Updated Feb 13, 2026
⚡ Key Takeaways
  • W&B integrates with PyTorch in 3 lines: wandb.init(), wandb.log(), and wandb.finish() — setup takes under 3 minutes.
  • Log training loss every N batches, validation metrics per epoch, and gradient norms to catch silent failures like frozen layers.
  • W&B's parallel coordinates plot reveals which hyperparameter combinations work, but only if you also log epoch count to spot late convergence.
  • Always store random seeds, library versions, and hardware info in wandb.config to ensure reproducibility months later.
  • Use W&B for fast iteration and cloud dashboards; switch to MLflow for self-hosting or production model serving.

The Problem Nobody Talks About

You’re training a model, tweaking hyperparameters, and two hours later you can’t remember which learning rate gave you that 0.87 validation accuracy. You’ve got a folder full of model_v3_final_ACTUAL.pt files and a spreadsheet you stopped updating around run 12.

Weights & Biases (W&B) fixes this by automatically logging metrics, hyperparameters, and system stats to a centralized dashboard. The setup takes about 3 minutes. The hard part is figuring out what to log and when to trust the numbers.

Here’s what happens when you run 50 experiments without tracking versus with W&B — and the specific logging patterns that separate useful dashboards from cluttered noise.

A digital glass weighing scale with a blue measuring tape, symbolizing weight management.
Photo by Pixabay on Pexels

Installing W&B and the 3-Line Integration

First, install the library and authenticate:

pip install wandb
wandb login

The wandb login command opens your browser to grab an API key. Paste it into the terminal. If you’re on a remote server, use wandb login --relogin and copy the key manually.

Here’s a minimal PyTorch training loop with W&B:

import torch
import torch.nn as nn
import wandb
from torch.utils.data import DataLoader, TensorDataset

# Initialize W&B run
wandb.init(
    project="image-classifier",
    config={
        "learning_rate": 0.001,
        "epochs": 10,
        "batch_size": 32,
        "architecture": "resnet18"
    }
)

config = wandb.config

# Dummy dataset for demo
X = torch.randn(1000, 3, 64, 64)
y = torch.randint(0, 10, (1000,))
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=config.batch_size, shuffle=True)

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=False)
model.fc = nn.Linear(model.fc.in_features, 10)
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)
criterion = nn.CrossEntropyLoss()

for epoch in range(config.epochs):
    for batch_idx, (data, target) in enumerate(loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        # Log every 10 batches to avoid overwhelming the dashboard
        if batch_idx % 10 == 0:
            wandb.log({"loss": loss.item(), "epoch": epoch})

wandb.finish()

That’s it. Three lines: wandb.init(), wandb.log(), and wandb.finish(). The dashboard appears at wandb.ai/<username>/<project> and updates in real-time.

What to Log (and What Not to Log)

The mistake I see most often: logging everything at every step. Your dashboard becomes a wall of 47 metrics and you can’t tell which ones matter.

Here’s what actually helps:

Always log:
– Training loss (every N batches, not every single step)
– Validation loss and metrics (after each epoch)
– Learning rate (if using a scheduler)
– Epoch number or global step

Log when debugging:
– Gradient norms (catches exploding gradients: θ2||nabla theta||_2)
– Weight norms (catches dead neurons)
– Layer-wise statistics (mean, std of activations)

Don’t log:
– Raw predictions (use wandb.Table for samples, not continuous logging)
– Per-sample losses (aggregate them first)
– Metrics at training frequency if you’re doing validation (it’s redundant noise)

Here’s a more realistic training loop with gradient monitoring:

import torch.nn.utils as nn_utils

for epoch in range(config.epochs):
    model.train()
    train_loss = 0.0

    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()

        # Gradient clipping to prevent exploding gradients
        grad_norm = nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        train_loss += loss.item()

        if batch_idx % 50 == 0:
            wandb.log({
                "batch_loss": loss.item(),
                "grad_norm": grad_norm.item(),
                "step": epoch * len(train_loader) + batch_idx
            })

    # Validation after each epoch
    model.eval()
    val_loss = 0.0
    correct = 0

    with torch.no_grad():
        for data, target in val_loader:
            output = model(data)
            val_loss += criterion(output, target).item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    val_loss /= len(val_loader)
    val_acc = correct / len(val_loader.dataset)

    wandb.log({
        "epoch": epoch,
        "train_loss_epoch": train_loss / len(train_loader),
        "val_loss": val_loss,
        "val_accuracy": val_acc
    })

Notice the distinction: batch-level metrics go into wandb.log() inside the training loop, epoch-level metrics get logged once per epoch. This keeps the charts readable.

The Metric That Saved Me 6 Hours of Debugging

I was training a transformer model and the loss kept plateauing around 2.3. No amount of learning rate tuning helped. Then I added this:

# After loss.backward(), before optimizer.step()
total_norm = 0.0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5

wandb.log({"gradient_norm": total_norm})

The gradient norm was stuck at 0.0001. Turns out I had accidentally set requires_grad=False on the embedding layer during a refactor. The model was training, but only the final linear layer was updating.

This is why I always log gradient norms now — it catches silent failures that don’t throw errors.

Comparing Runs: The Real Value of W&B

The dashboard’s parallel coordinates plot is where W&B becomes indispensable. You can filter runs by hyperparameters and see which combinations work.

Let’s say you ran 20 experiments with different learning rates and batch sizes:

for lr in [1e-4, 5e-4, 1e-3, 5e-3]:
    for bs in [16, 32, 64]:
        wandb.init(
            project="image-classifier",
            config={"learning_rate": lr, "batch_size": bs},
            reinit=True  # Start a new run each iteration
        )

        # ... train model ...

        wandb.finish()

In the W&B dashboard, go to the “Parallel Coordinates” view. Drag val_accuracy to the far right. You’ll see lines connecting hyperparameter values to final accuracy. Runs with accuracy > 0.85 will cluster around specific learning rates.

In my case, lr=5e-4 with bs=32 consistently outperformed everything else. But here’s the catch: lr=1e-3 with bs=64 also worked — it just needed 5 more epochs to converge. The dashboard doesn’t tell you that unless you also plot epoch as a coordinate.

When W&B Lied to Me (and How to Catch It)

I once ran a sweep where every single run reported val_accuracy=0.99. Impossible. Turned out I was logging training accuracy instead of validation accuracy because of a copy-paste error:

# WRONG: this logs training accuracy as val_accuracy
model.train()  # Still in training mode during validation
with torch.no_grad():
    for data, target in val_loader:
        output = model(data)
        # ... compute accuracy ...
        wandb.log({"val_accuracy": acc})  # But model.train() is still on!

The fix:

model.eval()  # Disable dropout and batchnorm updates
with torch.no_grad():
    # ... validation loop ...

Another silent failure: logging validation metrics inside the batch loop instead of after the full validation pass. You end up with 50 logged values per epoch, and the chart looks like noise.

W&B vs MLflow: Where Each One Wins

I’ve used both extensively. Here’s when I pick each:

Use W&B if:
– You want zero-config cloud dashboards (no server setup)
– You’re doing hyperparameter sweeps (W&B Sweeps is cleaner than MLflow’s)
– You need real-time collaboration (share a dashboard link with your team)
– You’re iterating fast and don’t care about self-hosting

Use MLflow if:
– You need to self-host everything (compliance, airgapped environments)
– You’re deploying models to production (MLflow’s model registry is more mature)
– You want a single tool for tracking + serving (as I covered in MLflow + FastAPI: $2K/Month Model Serving Side Project)

W&B’s free tier gives you unlimited runs but caps storage at 100GB. MLflow is free and unlimited if you host it yourself, but then you’re managing a server.

Logging Custom Metrics and Media

W&B supports more than scalar metrics. Here’s how to log images, histograms, and tables.

Images (useful for generative models or segmentation):

import matplotlib.pyplot as plt

# Log a matplotlib figure
fig, ax = plt.subplots()
ax.plot([1, 2, 3], [0.5, 0.7, 0.9])
wandb.log({"learning_curve": wandb.Image(fig)})
plt.close(fig)

# Log a PIL image or numpy array
import numpy as np
fake_img = np.random.rand(128, 128, 3)
wandb.log({"sample_output": wandb.Image(fake_img)})

Histograms (useful for weight distributions):

for name, param in model.named_parameters():
    wandb.log({f"weights/{name}": wandb.Histogram(param.data.cpu().numpy())})

Tables (useful for per-class metrics):

class_names = ["cat", "dog", "bird"]
precisions = [0.91, 0.87, 0.83]
recalls = [0.89, 0.85, 0.80]

table = wandb.Table(columns=["class", "precision", "recall"])
for cls, prec, rec in zip(class_names, precisions, recalls):
    table.add_data(cls, prec, rec)

wandb.log({"per_class_metrics": table})

I rarely log histograms in every run (they’re expensive), but I’ll turn them on when debugging weight initialization issues.

The One Thing I Wish I’d Known Earlier

Use wandb.config to store everything that could affect your results — not just hyperparameters. This includes:

  • Random seeds (torch.manual_seed(42))
  • Data augmentation settings (crop size, flip probability)
  • Hardware (GPU type, if you’re comparing runs across machines)
  • Library versions (torch.__version__)

Here’s a config that saved me when I tried to reproduce a result 3 months later:

import torch
import torchvision

wandb.init(
    project="image-classifier",
    config={
        "learning_rate": 0.001,
        "batch_size": 32,
        "seed": 42,
        "torch_version": torch.__version__,
        "torchvision_version": torchvision.__version__,
        "cuda_version": torch.version.cuda,
        "augmentation": "random_crop_flip",
        "model_checkpoint": "resnet18_imagenet"
    }
)

Without this, I would’ve spent hours figuring out why my “identical” rerun gave different results (spoiler: PyTorch 1.12 changed the default behavior of torch.Generator).

Where W&B Falls Short

Two pain points I hit regularly:

  1. Slow dashboard loading when you have 500+ runs. The parallel coordinates plot takes 10+ seconds to render. MLflow’s UI is faster here.
  2. No native support for multi-objective optimization. If you’re optimizing for both accuracy and inference speed, you’ll need to manually filter runs. (This is where hyperparameter sweeps come in — more on that in Part 2.)

Also, the free tier’s 100GB storage cap is tighter than it sounds if you’re logging images or model checkpoints. I hit it after about 80 runs of a generative model. The workaround: log fewer images, or pay for the Team plan ($50/month).

Wrapping Up: Start Logging Before You Think You Need It

The pattern I follow now: I add W&B to every training script from day one, even if it’s just a toy experiment. The time I save not having to re-run experiments because I forgot to save the config always pays off.

Use W&B for fast iteration and cloud dashboards. Switch to MLflow if you need self-hosting or production model serving.

In Part 2, we’ll cover W&B Sweeps — the automated hyperparameter search that turns 50 manual runs into a single command. I’m still not entirely sure how the Bayesian optimizer decides which hyperparameters to try next, but the results speak for themselves.

Weights & Biases Tutorial Series (1/3)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 679 | TOTAL 5,718