The Dirty Secret About Medical Image Anomaly Detection
Most anomaly detection pipelines in medical imaging are fragile. I don’t mean “occasionally fails on edge cases” fragile — I mean “retrain the entire thing when you switch from chest X-rays to retinal scans” fragile. After spending three months building a defect detection system for histopathology slides at work, I’m now convinced that self-supervised foundation models like DINOv2 (Oquab et al., 2023) are the only sane default for medical anomaly detection in 2025. Not fine-tuned classifiers. Not autoencoders. Not GANs.
DINOv2’s frozen features, pulled straight from a model that never saw a single medical image during pretraining, outperformed my carefully fine-tuned ResNet50 autoencoder on two out of three pathology datasets I tested. That result genuinely surprised me.
Why Autoencoders Keep Letting Us Down
The classic approach goes like this: train an autoencoder on “normal” images, then flag anything with high reconstruction error as anomalous. It’s elegant in theory. In practice, I’ve watched autoencoders learn to reconstruct anomalies just fine — especially when those anomalies share texture patterns with normal tissue. A subtle tumor infiltration in a histopathology patch? The autoencoder happily reconstructs it because it shares 90% of its low-level features with healthy tissue.
I tried the f-AnoGAN approach (Schlegl et al., 2019) early on, and the GAN training was a nightmare on 512×512 pathology patches. Mode collapse every other run, and the latent space mapping added another failure point. After two weeks of hyperparameter tuning, I got it working — with an AUROC of 0.73 on our internal breast histopathology dataset. Not terrible, but not something I’d put in front of a pathologist.
The PatchCore method (Roth et al., 2022) was a step up — it uses a pretrained backbone’s intermediate features and a coreset-subsampled memory bank. I got 0.87 AUROC with a WideResNet50 backbone on the same dataset. But here’s the thing: PatchCore’s memory bank grows with your normal training set, and on whole-slide images where you might have 50,000+ patches per slide, that memory bank becomes unwieldy fast.
Enter DINOv2 Features — No Fine-Tuning Required
DINOv2 is Meta’s self-supervised vision transformer, trained on 142 million curated images with no labels. The key insight that made me try it for medical anomaly detection: DINOv2’s intermediate features encode semantic structure at multiple granularities. Patch-level features from the ViT capture both local texture and global context in a way that ImageNet-pretrained CNNs simply don’t.
Here’s my actual extraction pipeline. Nothing fancy — that’s the point.
import torch
import numpy as np
from PIL import Image
from torchvision import transforms
# DINOv2 ViT-B/14 — 86M params, ~330MB checkpoint
# ViT-L/14 is 300M params, ~1.2GB — better but needs 8GB+ VRAM
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
model.eval().cuda()
# IMPORTANT: DINOv2 expects 518×518 for ViT-B/14 (14*37=518), NOT 224×224
# I wasted a full day on this — using 224 gives you 16×16 patch tokens
# instead of 37×37, and your spatial resolution tanks
preprocess = transforms.Compose([
transforms.Resize((518, 518), interpolation=transforms.InterpolationMode.BICUBIC),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
def extract_patch_features(image_path):
img = Image.open(image_path).convert('RGB')
x = preprocess(img).unsqueeze(0).cuda()
with torch.no_grad():
# get_intermediate_layers returns features BEFORE the head
# n=1 gives last layer; I found n=4 (last 4 layers averaged) works better
features = model.get_intermediate_layers(x, n=4)
# Average across layers — each is (1, 1369, 768) for ViT-B/14 at 518px
patch_tokens = torch.stack(features).mean(dim=0) # (1, 1369, 768)
return patch_tokens.squeeze(0).cpu().numpy() # (1369, 768)
That n=4 multi-layer averaging took me a while to figure out. Using only the last layer gave worse anomaly localization — my best guess is that earlier layers preserve more fine-grained spatial information that matters for detecting small lesions.
The Preprocessing Pitfall That Almost Killed My Pipeline
Medical images aren’t natural photos. This sounds obvious, but the implications for DINOv2 are subtle.
Histopathology slides are typically stained with Hematoxylin and Eosin (H&E), and the staining varies wildly between labs. My first batch of results looked incredible on slides from one hospital and completely fell apart on slides from another. The AUROC dropped from 0.91 to 0.68 — just from stain variation.
I tried Macenko stain normalization, which projects the image into a stain color space and re-normalizes. It helped, but introduced artifacts on images that were already well-stained. What actually worked was much simpler:
import cv2
def normalize_stain_simple(img_rgb):
# Convert to LAB, normalize L channel only
lab = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2LAB)
l_channel = lab[:, :, 0]
# CLAHE on luminance — adaptive histogram equalization
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
lab[:, :, 0] = clahe.apply(l_channel)
result = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
return result
Just CLAHE on the L channel in LAB color space. It doesn’t fix stain variation perfectly, but it stabilizes the luminance enough that DINOv2’s features become more consistent across sources. My cross-site AUROC went from 0.68 to 0.82 with this single change.
And here’s something I didn’t expect: applying DINOv2’s standard ImageNet normalization (mean=[0.485, 0.456, 0.406]) to medical images actually works fine. I computed the per-channel statistics of our pathology dataset and the means were [0.72, 0.54, 0.66] — quite different from ImageNet. But using dataset-specific normalization made results worse. I’m not entirely sure why, but I suspect DINOv2’s internal representations are calibrated to ImageNet statistics, and deviating from them shifts the feature distribution in unhelpful ways.
Building the Anomaly Scorer
Once you have DINOv2 patch features, you need a way to score anomalies. I tested three approaches on our breast histopathology dataset (4,200 normal patches, 1,100 anomalous patches, all 512×512 from 40× magnification):
k-NN in feature space. Dead simple. Collect patch features from normal training images, then score test images by their distance to the k-th nearest neighbor. I used k=5 with cosine distance.
from sklearn.neighbors import NearestNeighbors
# normal_features: (N, 1369, 768) — flatten to per-image
# I tried both: (a) average-pool patch tokens, (b) keep spatial structure
# Average pooling: (N, 768) per image — loses spatial info but fast
normal_pooled = normal_features.mean(axis=1) # (N, 768)
knn = NearestNeighbors(n_neighbors=5, metric='cosine')
knn.fit(normal_pooled)
def anomaly_score(test_features):
test_pooled = test_features.mean(axis=0, keepdims=True) # (1, 768)
distances, _ = knn.kneighbors(test_pooled)
return distances[0, -1] # distance to 5th neighbor
Gaussian density estimation. Fit a multivariate Gaussian to the normal features (or a low-rank approximation via PCA + Gaussian on the reduced space). This is essentially what PaDiM (Defard et al., 2021) does, but with DINOv2 features instead of a pretrained ResNet.
Patch-level memory bank (PatchCore-style). Keep a subsampled coreset of individual patch tokens and score each test patch independently, then aggregate. This gives you anomaly localization for free.
Here’s what I got on our internal dataset (5-fold cross-validation, evaluated on an RTX 3090 with 24GB VRAM):
| Method | Backbone | Image AUROC | Pixel AUROC | Inference (ms/img) |
|---|---|---|---|---|
| k-NN (k=5) | DINOv2 ViT-B/14 | 0.91 | — | 12 |
| Gaussian (PCA-512) | DINOv2 ViT-B/14 | 0.89 | — | 8 |
| Patch Memory Bank | DINOv2 ViT-B/14 | 0.93 | 0.88 | 47 |
| Patch Memory Bank | DINOv2 ViT-L/14 | 0.95 | 0.91 | 83 |
| PatchCore (original) | WideResNet50 | 0.87 | 0.83 | 35 |
| f-AnoGAN | Custom DCGAN | 0.73 | 0.61 | 28 |
The ViT-L/14 backbone with patch-level scoring hit 0.95 AUROC without seeing a single medical image during backbone training. That’s the number that convinced me.
But I should be honest about the limitations of this evaluation. 4,200 normal + 1,100 anomalous patches is small. The anomalies were verified by a pathologist, but they skew toward obvious cases. I haven’t tested this at the scale of a real clinical validation study, and I suspect performance on subtle, early-stage anomalies would be lower.
Anomaly Localization — Where It Gets Interesting
Image-level detection is useful, but pathologists want to know where the anomaly is. The patch-level approach naturally gives you a spatial anomaly map.
def generate_anomaly_map(test_image_path, normal_patch_bank, knn_index):
patch_features = extract_patch_features(test_image_path) # (1369, 768)
# Score each patch token independently
distances, _ = knn_index.kneighbors(patch_features) # (1369, k)
patch_scores = distances[:, -1] # (1369,)
# Reshape to spatial grid — 37×37 for 518px input with patch_size=14
h = w = int(np.sqrt(len(patch_scores))) # should be 37
if h * w != len(patch_scores):
# this shouldn't happen with correct input size, but just in case
print(f"WARNING: patch count {len(patch_scores)} != {h}×{w}, padding")
patch_scores = patch_scores[:h*w]
anomaly_map = patch_scores.reshape(h, w)
# Upsample to original resolution
anomaly_map = cv2.resize(anomaly_map, (518, 518), interpolation=cv2.INTER_CUBIC)
# Gaussian smoothing — sigma=4 worked well empirically
anomaly_map = cv2.GaussianBlur(anomaly_map, (0, 0), sigmaX=4)
return anomaly_map
The localization maps are surprisingly clean. On tumor regions, the anomaly scores spike precisely at the tumor boundary — not in the center of the mass, interestingly, but at the interface between normal and abnormal tissue. My interpretation is that DINOv2’s patch features capture local context, so patches at the boundary see the sharpest feature distribution shift compared to normal tissue.
One practical issue: the 37×37 spatial resolution from ViT-B/14 means each “pixel” in your anomaly map covers roughly 14×14 pixels of the input. For 512×512 pathology patches that’s acceptable, but if you’re working with full-resolution whole-slide images at 100,000+ pixels per side, you’ll need a tiling strategy. I process overlapping 512×512 tiles with 50% overlap and average the anomaly maps in the overlapping regions.
The VRAM Reality Check
DINOv2 ViT-B/14 needs about 3.5GB VRAM for inference at 518×518 with batch size 1. Bump that to ViT-L/14 and you’re at 6GB. ViT-G/14 (the giant variant) needs 14GB+ and I couldn’t get it to run on a single RTX 3090 with batch size > 1 at 518px resolution.
For the memory bank approach, you also need RAM for storing the coreset. With 4,200 training images and a 10% coreset sampling rate, that’s about 420 × 1369 × 768 × 4 bytes ≈ 1.7GB of feature vectors in memory. On our 1GB Oracle Cloud server? Absolutely not happening — this is a workstation-only pipeline.
Batch processing helps with throughput but watch out for a subtle bug I hit:
# WRONG — DINOv2's get_intermediate_layers behaves differently with batch>1
# in some torch.hub versions (I hit this on torchvision 0.16.0)
batch = torch.stack([preprocess(img) for img in images]).cuda()
features = model.get_intermediate_layers(batch, n=4)
# This gave me (4, B, N, D) on some versions and (4, B*N, D) on others
# SAFE — process individually and stack
all_features = []
for img in images:
x = preprocess(img).unsqueeze(0).cuda()
feats = model.get_intermediate_layers(x, n=4)
all_features.append(torch.stack(feats).mean(dim=0).squeeze(0))
features = torch.stack(all_features)
Is this slower? Yes. Did it save me from a shape mismatch bug that produced silently wrong results for a week? Also yes.
What DINOv2 Can’t Do (Yet)
I want to be fair about where this approach falls short.
First, truly subtle anomalies. Microcalcifications in mammograms, for instance, are sometimes just 2-3 pixels across at typical imaging resolutions. DINOv2’s 14×14 patch size means these can fall within a single patch token, and if the surrounding tissue looks normal, the anomaly gets diluted. I tested on a small mammography dataset (200 images from CBIS-DDSM) and got 0.79 AUROC — decent but not clinical-grade.
Second, 3D medical imaging. CT and MRI volumes need slice-by-slice or volumetric processing, and naively applying DINOv2 to individual slices throws away inter-slice context. There’s ongoing work on extending vision transformers to 3D (I’ve seen some papers from MICCAI 2024 on this), but I haven’t tried it myself.
Third, domain-specific fine-tuning. BiomedCLIP and other medical foundation models exist for a reason. On my breast histopathology dataset, a BiomedCLIP backbone gave 0.94 AUROC with the same k-NN scoring — close to DINOv2 ViT-L/14’s 0.95. But BiomedCLIP is much smaller (ViT-B/16, ~86M params) and faster at inference. If you’re working exclusively in one medical imaging domain and have the data to validate, a domain-specific model might be the pragmatic choice.
Does the generalist nature of DINOv2 win when you need to switch between imaging modalities — say, from histopathology to dermatology to chest X-rays — without retraining anything? Absolutely. That’s the real argument here.
So When Should You Actually Use This?
Use DINOv2 + patch-level k-NN as your anomaly detection baseline when you’re starting a new medical imaging anomaly detection project and don’t have tens of thousands of labeled anomalies. Specifically: if you have fewer than 500 labeled anomalous examples, don’t bother training a supervised classifier. Extract DINOv2 features from your normal samples, build a memory bank, and score new images by nearest-neighbor distance. You’ll get a working system in an afternoon.
If you have abundant labeled data (5,000+ anomalous examples per class), a supervised approach will likely beat this. Fine-tune a DINOv2 backbone with a classification head — but at that point, you’re not really doing anomaly detection anymore, you’re doing classification.
For production deployment where inference speed matters: use ViT-B/14, not ViT-L/14. The 0.02 AUROC difference isn’t worth the 2× latency increase in most clinical workflows. And consider distilling the features into a smaller model once you’ve validated the approach — running a full ViT-B/14 on every image in a high-throughput screening setting is expensive.
What I’m personally curious about next: combining DINOv2 features with lightweight normalizing flows instead of k-NN for the scoring step. FastFlow and CFlow-AD have shown promising results on industrial anomaly detection, and I suspect they’d handle the high-dimensional feature space more gracefully than a naive nearest-neighbor search. I’ve started some experiments but don’t have conclusive results yet — maybe that’s another post.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply