- W&B integrates with PyTorch in 3 lines: wandb.init(), wandb.log(), and wandb.finish() — setup takes under 3 minutes.
- Log training loss every N batches, validation metrics per epoch, and gradient norms to catch silent failures like frozen layers.
- W&B's parallel coordinates plot reveals which hyperparameter combinations work, but only if you also log epoch count to spot late convergence.
- Always store random seeds, library versions, and hardware info in wandb.config to ensure reproducibility months later.
- Use W&B for fast iteration and cloud dashboards; switch to MLflow for self-hosting or production model serving.
The Problem Nobody Talks About
You’re training a model, tweaking hyperparameters, and two hours later you can’t remember which learning rate gave you that 0.87 validation accuracy. You’ve got a folder full of model_v3_final_ACTUAL.pt files and a spreadsheet you stopped updating around run 12.
Weights & Biases (W&B) fixes this by automatically logging metrics, hyperparameters, and system stats to a centralized dashboard. The setup takes about 3 minutes. The hard part is figuring out what to log and when to trust the numbers.
Here’s what happens when you run 50 experiments without tracking versus with W&B — and the specific logging patterns that separate useful dashboards from cluttered noise.

Installing W&B and the 3-Line Integration
First, install the library and authenticate:
pip install wandb
wandb login
The wandb login command opens your browser to grab an API key. Paste it into the terminal. If you’re on a remote server, use wandb login --relogin and copy the key manually.
Here’s a minimal PyTorch training loop with W&B:
import torch
import torch.nn as nn
import wandb
from torch.utils.data import DataLoader, TensorDataset
# Initialize W&B run
wandb.init(
project="image-classifier",
config={
"learning_rate": 0.001,
"epochs": 10,
"batch_size": 32,
"architecture": "resnet18"
}
)
config = wandb.config
# Dummy dataset for demo
X = torch.randn(1000, 3, 64, 64)
y = torch.randint(0, 10, (1000,))
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=config.batch_size, shuffle=True)
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=False)
model.fc = nn.Linear(model.fc.in_features, 10)
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)
criterion = nn.CrossEntropyLoss()
for epoch in range(config.epochs):
for batch_idx, (data, target) in enumerate(loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Log every 10 batches to avoid overwhelming the dashboard
if batch_idx % 10 == 0:
wandb.log({"loss": loss.item(), "epoch": epoch})
wandb.finish()
That’s it. Three lines: wandb.init(), wandb.log(), and wandb.finish(). The dashboard appears at wandb.ai/<username>/<project> and updates in real-time.
What to Log (and What Not to Log)
The mistake I see most often: logging everything at every step. Your dashboard becomes a wall of 47 metrics and you can’t tell which ones matter.
Here’s what actually helps:
Always log:
– Training loss (every N batches, not every single step)
– Validation loss and metrics (after each epoch)
– Learning rate (if using a scheduler)
– Epoch number or global step
Log when debugging:
– Gradient norms (catches exploding gradients: )
– Weight norms (catches dead neurons)
– Layer-wise statistics (mean, std of activations)
Don’t log:
– Raw predictions (use wandb.Table for samples, not continuous logging)
– Per-sample losses (aggregate them first)
– Metrics at training frequency if you’re doing validation (it’s redundant noise)
Here’s a more realistic training loop with gradient monitoring:
import torch.nn.utils as nn_utils
for epoch in range(config.epochs):
model.train()
train_loss = 0.0
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
# Gradient clipping to prevent exploding gradients
grad_norm = nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
train_loss += loss.item()
if batch_idx % 50 == 0:
wandb.log({
"batch_loss": loss.item(),
"grad_norm": grad_norm.item(),
"step": epoch * len(train_loader) + batch_idx
})
# Validation after each epoch
model.eval()
val_loss = 0.0
correct = 0
with torch.no_grad():
for data, target in val_loader:
output = model(data)
val_loss += criterion(output, target).item()
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
val_loss /= len(val_loader)
val_acc = correct / len(val_loader.dataset)
wandb.log({
"epoch": epoch,
"train_loss_epoch": train_loss / len(train_loader),
"val_loss": val_loss,
"val_accuracy": val_acc
})
Notice the distinction: batch-level metrics go into wandb.log() inside the training loop, epoch-level metrics get logged once per epoch. This keeps the charts readable.
The Metric That Saved Me 6 Hours of Debugging
I was training a transformer model and the loss kept plateauing around 2.3. No amount of learning rate tuning helped. Then I added this:
# After loss.backward(), before optimizer.step()
total_norm = 0.0
for p in model.parameters():
if p.grad is not None:
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
wandb.log({"gradient_norm": total_norm})
The gradient norm was stuck at 0.0001. Turns out I had accidentally set requires_grad=False on the embedding layer during a refactor. The model was training, but only the final linear layer was updating.
This is why I always log gradient norms now — it catches silent failures that don’t throw errors.
Comparing Runs: The Real Value of W&B
The dashboard’s parallel coordinates plot is where W&B becomes indispensable. You can filter runs by hyperparameters and see which combinations work.
Let’s say you ran 20 experiments with different learning rates and batch sizes:
for lr in [1e-4, 5e-4, 1e-3, 5e-3]:
for bs in [16, 32, 64]:
wandb.init(
project="image-classifier",
config={"learning_rate": lr, "batch_size": bs},
reinit=True # Start a new run each iteration
)
# ... train model ...
wandb.finish()
In the W&B dashboard, go to the “Parallel Coordinates” view. Drag val_accuracy to the far right. You’ll see lines connecting hyperparameter values to final accuracy. Runs with accuracy > 0.85 will cluster around specific learning rates.
In my case, lr=5e-4 with bs=32 consistently outperformed everything else. But here’s the catch: lr=1e-3 with bs=64 also worked — it just needed 5 more epochs to converge. The dashboard doesn’t tell you that unless you also plot epoch as a coordinate.
When W&B Lied to Me (and How to Catch It)
I once ran a sweep where every single run reported val_accuracy=0.99. Impossible. Turned out I was logging training accuracy instead of validation accuracy because of a copy-paste error:
# WRONG: this logs training accuracy as val_accuracy
model.train() # Still in training mode during validation
with torch.no_grad():
for data, target in val_loader:
output = model(data)
# ... compute accuracy ...
wandb.log({"val_accuracy": acc}) # But model.train() is still on!
The fix:
model.eval() # Disable dropout and batchnorm updates
with torch.no_grad():
# ... validation loop ...
Another silent failure: logging validation metrics inside the batch loop instead of after the full validation pass. You end up with 50 logged values per epoch, and the chart looks like noise.
W&B vs MLflow: Where Each One Wins
I’ve used both extensively. Here’s when I pick each:
Use W&B if:
– You want zero-config cloud dashboards (no server setup)
– You’re doing hyperparameter sweeps (W&B Sweeps is cleaner than MLflow’s)
– You need real-time collaboration (share a dashboard link with your team)
– You’re iterating fast and don’t care about self-hosting
Use MLflow if:
– You need to self-host everything (compliance, airgapped environments)
– You’re deploying models to production (MLflow’s model registry is more mature)
– You want a single tool for tracking + serving (as I covered in MLflow + FastAPI: $2K/Month Model Serving Side Project)
W&B’s free tier gives you unlimited runs but caps storage at 100GB. MLflow is free and unlimited if you host it yourself, but then you’re managing a server.
Logging Custom Metrics and Media
W&B supports more than scalar metrics. Here’s how to log images, histograms, and tables.
Images (useful for generative models or segmentation):
import matplotlib.pyplot as plt
# Log a matplotlib figure
fig, ax = plt.subplots()
ax.plot([1, 2, 3], [0.5, 0.7, 0.9])
wandb.log({"learning_curve": wandb.Image(fig)})
plt.close(fig)
# Log a PIL image or numpy array
import numpy as np
fake_img = np.random.rand(128, 128, 3)
wandb.log({"sample_output": wandb.Image(fake_img)})
Histograms (useful for weight distributions):
for name, param in model.named_parameters():
wandb.log({f"weights/{name}": wandb.Histogram(param.data.cpu().numpy())})
Tables (useful for per-class metrics):
class_names = ["cat", "dog", "bird"]
precisions = [0.91, 0.87, 0.83]
recalls = [0.89, 0.85, 0.80]
table = wandb.Table(columns=["class", "precision", "recall"])
for cls, prec, rec in zip(class_names, precisions, recalls):
table.add_data(cls, prec, rec)
wandb.log({"per_class_metrics": table})
I rarely log histograms in every run (they’re expensive), but I’ll turn them on when debugging weight initialization issues.
The One Thing I Wish I’d Known Earlier
Use wandb.config to store everything that could affect your results — not just hyperparameters. This includes:
- Random seeds (
torch.manual_seed(42)) - Data augmentation settings (crop size, flip probability)
- Hardware (GPU type, if you’re comparing runs across machines)
- Library versions (
torch.__version__)
Here’s a config that saved me when I tried to reproduce a result 3 months later:
import torch
import torchvision
wandb.init(
project="image-classifier",
config={
"learning_rate": 0.001,
"batch_size": 32,
"seed": 42,
"torch_version": torch.__version__,
"torchvision_version": torchvision.__version__,
"cuda_version": torch.version.cuda,
"augmentation": "random_crop_flip",
"model_checkpoint": "resnet18_imagenet"
}
)
Without this, I would’ve spent hours figuring out why my “identical” rerun gave different results (spoiler: PyTorch 1.12 changed the default behavior of torch.Generator).
Where W&B Falls Short
Two pain points I hit regularly:
- Slow dashboard loading when you have 500+ runs. The parallel coordinates plot takes 10+ seconds to render. MLflow’s UI is faster here.
- No native support for multi-objective optimization. If you’re optimizing for both accuracy and inference speed, you’ll need to manually filter runs. (This is where hyperparameter sweeps come in — more on that in Part 2.)
Also, the free tier’s 100GB storage cap is tighter than it sounds if you’re logging images or model checkpoints. I hit it after about 80 runs of a generative model. The workaround: log fewer images, or pay for the Team plan ($50/month).
Wrapping Up: Start Logging Before You Think You Need It
The pattern I follow now: I add W&B to every training script from day one, even if it’s just a toy experiment. The time I save not having to re-run experiments because I forgot to save the config always pays off.
Use W&B for fast iteration and cloud dashboards. Switch to MLflow if you need self-hosting or production model serving.
In Part 2, we’ll cover W&B Sweeps — the automated hyperparameter search that turns 50 manual runs into a single command. I’m still not entirely sure how the Bayesian optimizer decides which hyperparameters to try next, but the results speak for themselves.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply