ML Experiment Tracking with W&B: Lessons from 50 Failed Runs

Q: How does Installing W&B and the 3-Line Integration work?

First, install the library and authenticate: pip install wandb wandb login The wandb login command opens your browser to grab an API key. Paste it into the terminal. If you’re on a remote server, use wandb login --relogin and copy the key manually. Here’s a minimal PyTorch training lo

Q: What to Log (and What Not to Log)?

The mistake I see most often: logging everything at every step. Your dashboard becomes a wall of 47 metrics and you can’t tell which ones matter. Here’s what actually helps: Always log: – Training loss (every N batches, not every single step) – Validation loss and metrics (af

⚡ Key Takeaways

W&B integrates with PyTorch in 3 lines: wandb.init(), wandb.log(), and wandb.finish() — setup takes under 3 minutes.
Log training loss every N batches, validation metrics per epoch, and gradient norms to catch silent failures like frozen layers.
W&B's parallel coordinates plot reveals which hyperparameter combinations work, but only if you also log epoch count to spot late convergence.
Always store random seeds, library versions, and hardware info in wandb.config to ensure reproducibility months later.
Use W&B for fast iteration and cloud dashboards; switch to MLflow for self-hosting or production model serving.

The Problem Nobody Talks About

You’re training a model, tweaking hyperparameters, and two hours later you can’t remember which learning rate gave you that 0.87 validation accuracy. You’ve got a folder full of model_v3_final_ACTUAL.pt files and a spreadsheet you stopped updating around run 12.

Weights & Biases (W&B) fixes this by automatically logging metrics, hyperparameters, and system stats to a centralized dashboard. The setup takes about 3 minutes. The hard part is figuring out what to log and when to trust the numbers.

Here’s what happens when you run 50 experiments without tracking versus with W&B — and the specific logging patterns that separate useful dashboards from cluttered noise.

A digital glass weighing scale with a blue measuring tape, symbolizing weight management. — Photo by Pixabay on Pexels

Installing W&B and the 3-Line Integration

First, install the library and authenticate:

pip install wandb
wandb login

The wandb login command opens your browser to grab an API key. Paste it into the terminal. If you’re on a remote server, use wandb login --relogin and copy the key manually.

Here’s a minimal PyTorch training loop with W&B:

import torch
import torch.nn as nn
import wandb
from torch.utils.data import DataLoader, TensorDataset

# Initialize W&B run
wandb.init(
    project="image-classifier",
    config={
        "learning_rate": 0.001,
        "epochs": 10,
        "batch_size": 32,
        "architecture": "resnet18"
    }
)

config = wandb.config

# Dummy dataset for demo
X = torch.randn(1000, 3, 64, 64)
y = torch.randint(0, 10, (1000,))
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=config.batch_size, shuffle=True)

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=False)
model.fc = nn.Linear(model.fc.in_features, 10)
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)
criterion = nn.CrossEntropyLoss()

for epoch in range(config.epochs):
    for batch_idx, (data, target) in enumerate(loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        # Log every 10 batches to avoid overwhelming the dashboard
        if batch_idx % 10 == 0:
            wandb.log({"loss": loss.item(), "epoch": epoch})

wandb.finish()

That’s it. Three lines: wandb.init(), wandb.log(), and wandb.finish(). The dashboard appears at wandb.ai/<username>/<project> and updates in real-time.

What to Log (and What Not to Log)

The mistake I see most often: logging everything at every step. Your dashboard becomes a wall of 47 metrics and you can’t tell which ones matter.

Here’s what actually helps:

Always log:
– Training loss (every N batches, not every single step)
– Validation loss and metrics (after each epoch)
– Learning rate (if using a scheduler)
– Epoch number or global step

Log when debugging:
– Gradient norms (catches exploding gradients: $||\nabla \theta||_2$ )
– Weight norms (catches dead neurons)
– Layer-wise statistics (mean, std of activations)

Don’t log:
– Raw predictions (use wandb.Table for samples, not continuous logging)
– Per-sample losses (aggregate them first)
– Metrics at training frequency if you’re doing validation (it’s redundant noise)

Here’s a more realistic training loop with gradient monitoring:

import torch.nn.utils as nn_utils

for epoch in range(config.epochs):
    model.train()
    train_loss = 0.0

    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()

        # Gradient clipping to prevent exploding gradients
        grad_norm = nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        train_loss += loss.item()

        if batch_idx % 50 == 0:
            wandb.log({
                "batch_loss": loss.item(),
                "grad_norm": grad_norm.item(),
                "step": epoch * len(train_loader) + batch_idx
            })

    # Validation after each epoch
    model.eval()
    val_loss = 0.0
    correct = 0

    with torch.no_grad():
        for data, target in val_loader:
            output = model(data)
            val_loss += criterion(output, target).item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    val_loss /= len(val_loader)
    val_acc = correct / len(val_loader.dataset)

    wandb.log({
        "epoch": epoch,
        "train_loss_epoch": train_loss / len(train_loader),
        "val_loss": val_loss,
        "val_accuracy": val_acc
    })

Notice the distinction: batch-level metrics go into wandb.log() inside the training loop, epoch-level metrics get logged once per epoch. This keeps the charts readable.

The Metric That Saved Me 6 Hours of Debugging

I was training a transformer model and the loss kept plateauing around 2.3. No amount of learning rate tuning helped. Then I added this:

# After loss.backward(), before optimizer.step()
total_norm = 0.0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5

wandb.log({"gradient_norm": total_norm})

The gradient norm was stuck at 0.0001. Turns out I had accidentally set requires_grad=False on the embedding layer during a refactor. The model was training, but only the final linear layer was updating.

This is why I always log gradient norms now — it catches silent failures that don’t throw errors.

Comparing Runs: The Real Value of W&B

The dashboard’s parallel coordinates plot is where W&B becomes indispensable. You can filter runs by hyperparameters and see which combinations work.

Let’s say you ran 20 experiments with different learning rates and batch sizes:

for lr in [1e-4, 5e-4, 1e-3, 5e-3]:
    for bs in [16, 32, 64]:
        wandb.init(
            project="image-classifier",
            config={"learning_rate": lr, "batch_size": bs},
            reinit=True  # Start a new run each iteration
        )

        # ... train model ...

        wandb.finish()

In the W&B dashboard, go to the “Parallel Coordinates” view. Drag val_accuracy to the far right. You’ll see lines connecting hyperparameter values to final accuracy. Runs with accuracy > 0.85 will cluster around specific learning rates.

In my case, lr=5e-4 with bs=32 consistently outperformed everything else. But here’s the catch: lr=1e-3 with bs=64 also worked — it just needed 5 more epochs to converge. The dashboard doesn’t tell you that unless you also plot epoch as a coordinate.

When W&B Lied to Me (and How to Catch It)

I once ran a sweep where every single run reported val_accuracy=0.99. Impossible. Turned out I was logging training accuracy instead of validation accuracy because of a copy-paste error:

# WRONG: this logs training accuracy as val_accuracy
model.train()  # Still in training mode during validation
with torch.no_grad():
    for data, target in val_loader:
        output = model(data)
        # ... compute accuracy ...
        wandb.log({"val_accuracy": acc})  # But model.train() is still on!

The fix:

model.eval()  # Disable dropout and batchnorm updates
with torch.no_grad():
    # ... validation loop ...

Another silent failure: logging validation metrics inside the batch loop instead of after the full validation pass. You end up with 50 logged values per epoch, and the chart looks like noise.

W&B vs MLflow: Where Each One Wins

I’ve used both extensively. Here’s when I pick each:

Use W&B if:
– You want zero-config cloud dashboards (no server setup)
– You’re doing hyperparameter sweeps (W&B Sweeps is cleaner than MLflow’s)
– You need real-time collaboration (share a dashboard link with your team)
– You’re iterating fast and don’t care about self-hosting

Use MLflow if:
– You need to self-host everything (compliance, airgapped environments)
– You’re deploying models to production (MLflow’s model registry is more mature)
– You want a single tool for tracking + serving (as I covered in MLflow + FastAPI: $2K/Month Model Serving Side Project)

W&B’s free tier gives you unlimited runs but caps storage at 100GB. MLflow is free and unlimited if you host it yourself, but then you’re managing a server.

Logging Custom Metrics and Media

W&B supports more than scalar metrics. Here’s how to log images, histograms, and tables.

Images (useful for generative models or segmentation):

import matplotlib.pyplot as plt

# Log a matplotlib figure
fig, ax = plt.subplots()
ax.plot([1, 2, 3], [0.5, 0.7, 0.9])
wandb.log({"learning_curve": wandb.Image(fig)})
plt.close(fig)

# Log a PIL image or numpy array
import numpy as np
fake_img = np.random.rand(128, 128, 3)
wandb.log({"sample_output": wandb.Image(fake_img)})

Histograms (useful for weight distributions):

for name, param in model.named_parameters():
    wandb.log({f"weights/{name}": wandb.Histogram(param.data.cpu().numpy())})

Tables (useful for per-class metrics):

class_names = ["cat", "dog", "bird"]
precisions = [0.91, 0.87, 0.83]
recalls = [0.89, 0.85, 0.80]

table = wandb.Table(columns=["class", "precision", "recall"])
for cls, prec, rec in zip(class_names, precisions, recalls):
    table.add_data(cls, prec, rec)

wandb.log({"per_class_metrics": table})

I rarely log histograms in every run (they’re expensive), but I’ll turn them on when debugging weight initialization issues.

The One Thing I Wish I’d Known Earlier

Use wandb.config to store everything that could affect your results — not just hyperparameters. This includes:

Random seeds (torch.manual_seed(42))
Data augmentation settings (crop size, flip probability)
Hardware (GPU type, if you’re comparing runs across machines)
Library versions (torch.__version__)

Here’s a config that saved me when I tried to reproduce a result 3 months later:

import torch
import torchvision

wandb.init(
    project="image-classifier",
    config={
        "learning_rate": 0.001,
        "batch_size": 32,
        "seed": 42,
        "torch_version": torch.__version__,
        "torchvision_version": torchvision.__version__,
        "cuda_version": torch.version.cuda,
        "augmentation": "random_crop_flip",
        "model_checkpoint": "resnet18_imagenet"
    }
)

Without this, I would’ve spent hours figuring out why my “identical” rerun gave different results (spoiler: PyTorch 1.12 changed the default behavior of torch.Generator).

Where W&B Falls Short

Two pain points I hit regularly:

Slow dashboard loading when you have 500+ runs. The parallel coordinates plot takes 10+ seconds to render. MLflow’s UI is faster here.
No native support for multi-objective optimization. If you’re optimizing for both accuracy and inference speed, you’ll need to manually filter runs. (This is where hyperparameter sweeps come in — more on that in Part 2.)

Also, the free tier’s 100GB storage cap is tighter than it sounds if you’re logging images or model checkpoints. I hit it after about 80 runs of a generative model. The workaround: log fewer images, or pay for the Team plan ($50/month).

Wrapping Up: Start Logging Before You Think You Need It

The pattern I follow now: I add W&B to every training script from day one, even if it’s just a toy experiment. The time I save not having to re-run experiments because I forgot to save the config always pays off.

Use W&B for fast iteration and cloud dashboards. Switch to MLflow if you need self-hosting or production model serving.

In Part 2, we’ll cover W&B Sweeps — the automated hyperparameter search that turns 50 manual runs into a single command. I’m still not entirely sure how the Bayesian optimizer decides which hyperparameters to try next, but the results speak for themselves.

Weights & Biases Tutorial Series (1/3)

Next: Hyperparameter Tuning at Scale with W&B Sweeps: Bayesian vs Grid Search on Real Training Runs →

Did you find this helpful?

☕ Buy me a coffee

ML Experiment Tracking with W&B: Lessons from 50 Failed Runs

The Problem Nobody Talks About

Installing W&B and the 3-Line Integration

What to Log (and What Not to Log)

The Metric That Saved Me 6 Hours of Debugging

Comparing Runs: The Real Value of W&B

When W&B Lied to Me (and How to Catch It)

W&B vs MLflow: Where Each One Wins

Logging Custom Metrics and Media

The One Thing I Wish I’d Known Earlier

Where W&B Falls Short

Wrapping Up: Start Logging Before You Think You Need It

Comments

Leave a Reply Cancel reply

ML Experiment Tracking with W&B: Lessons from 50 Failed Runs

The Problem Nobody Talks About

Installing W&B and the 3-Line Integration

What to Log (and What Not to Log)

The Metric That Saved Me 6 Hours of Debugging

Comparing Runs: The Real Value of W&B

When W&B Lied to Me (and How to Catch It)

W&B vs MLflow: Where Each One Wins

Logging Custom Metrics and Media

The One Thing I Wish I’d Known Earlier

Where W&B Falls Short

Wrapping Up: Start Logging Before You Think You Need It

Related Posts

Comments

Leave a Reply Cancel reply