Self-Supervised Contrastive Learning for Fault Diagnosis with Limited Labels

Updated Feb 6, 2026

I stopped chasing labeled data after this worked

Here’s the truth: you don’t need 10,000 labeled bearing fault samples to build a decent classifier. I know that sounds ridiculous when every paper benchmarks on CWRU with its neatly labeled categories, but contrastive learning proved me wrong about what’s actually necessary.

I had 200 labeled vibration samples across four fault types (inner race, outer race, ball, normal) and about 8,000 unlabeled recordings from the same gearbox over six months. The labeled set came from controlled experiments. The unlabeled pile? That’s production data—noisy, unlabeled, probably gold if I could use it.

Standard supervised CNN got me 73% accuracy on the test set. Not terrible, but not production-ready either. Then I tried SimCLR-style contrastive pretraining on the unlabeled data before fine-tuning on the labels, and accuracy jumped to 89%. Same architecture. Same labeled data. Just a different training recipe.

How contrastive learning actually works (without the academic fluff)

The core idea: teach the model to recognize that two augmented versions of the same signal are more similar than two different signals. You don’t need labels for this—just the raw data and some creative data augmentation.

Here’s the pipeline I used:

Take an unlabeled vibration signal
Apply two different random augmentations to create two “views”
Pass both through the encoder network
Pull the representations of the same signal closer in embedding space
Push representations of different signals apart

The augmentations matter more than you’d think. For vibration data, I used:

Time masking (zero out random 10-50ms chunks)
Amplitude scaling (0.8x to 1.2x)
Additive Gaussian noise (SNR between 20-30 dB)
Time shifting (circular shift by ±500 samples)
Frequency masking in STFT (randomly zero out 2-5 frequency bins)

These augmentations preserve the fault signature while introducing enough variation that the model has to learn robust features. I’m not entirely sure why frequency masking worked so well—my guess is it forces the model to not rely on any single harmonic component.

import numpy as np
import torch
import torch.nn as nn
from scipy import signal

class VibrationAugmenter:
    def __init__(self, fs=12000):
        self.fs = fs

    def time_mask(self, x, max_mask_len=600):
        # Zero out random chunks (up to 50ms at 12kHz)
        mask_len = np.random.randint(120, max_mask_len)
        mask_start = np.random.randint(0, len(x) - mask_len)
        x_aug = x.copy()
        x_aug[mask_start:mask_start + mask_len] = 0
        return x_aug

    def amplitude_scale(self, x):
        scale = np.random.uniform(0.8, 1.2)
        return x * scale

    def add_noise(self, x, snr_db_range=(20, 30)):
        snr_db = np.random.uniform(*snr_db_range)
        signal_power = np.mean(x ** 2)
        noise_power = signal_power / (10 ** (snr_db / 10))
        noise = np.random.normal(0, np.sqrt(noise_power), len(x))
        return x + noise

    def time_shift(self, x, max_shift=500):
        shift = np.random.randint(-max_shift, max_shift)
        return np.roll(x, shift)

    def freq_mask(self, x, n_masks=3):
        # STFT -> mask random freq bins -> iSTFT
        f, t, Zxx = signal.stft(x, fs=self.fs, nperseg=256)
        for _ in range(n_masks):
            mask_idx = np.random.randint(0, len(f))
            Zxx[mask_idx, :] = 0
        _, x_aug = signal.istft(Zxx, fs=self.fs)
        return x_aug[:len(x)]  # ensure same length

    def augment(self, x):
        # Apply 2-3 random augmentations
        ops = [self.time_mask, self.amplitude_scale, 
               self.add_noise, self.time_shift, self.freq_mask]
        chosen = np.random.choice(ops, size=np.random.randint(2, 4), replace=False)
        for op in chosen:
            x = op(x)
        return x

class ContrastiveEncoder(nn.Module):
    def __init__(self, signal_len=2048, embed_dim=128):
        super().__init__()
        # 1D CNN encoder for vibration signals
        self.conv1 = nn.Conv1d(1, 32, kernel_size=15, stride=2, padding=7)
        self.conv2 = nn.Conv1d(32, 64, kernel_size=11, stride=2, padding=5)
        self.conv3 = nn.Conv1d(64, 128, kernel_size=7, stride=2, padding=3)
        self.pool = nn.AdaptiveAvgPool1d(1)
        self.fc = nn.Linear(128, embed_dim)

    def forward(self, x):
        # x: (batch, 2048)
        x = x.unsqueeze(1)  # (batch, 1, 2048)
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = self.pool(x).squeeze(-1)  # (batch, 128)
        x = self.fc(x)  # (batch, embed_dim)
        return x

class ContrastiveLoss(nn.Module):
    def __init__(self, temperature=0.5):
        super().__init__()
        self.temperature = temperature

    def forward(self, z_i, z_j):
        # z_i, z_j: (batch, embed_dim) - two views of same batch
        batch_size = z_i.shape[0]
        z = torch.cat([z_i, z_j], dim=0)  # (2*batch, embed_dim)

        # Cosine similarity matrix
        z_norm = nn.functional.normalize(z, dim=1)
        sim_matrix = torch.mm(z_norm, z_norm.t()) / self.temperature  # (2*batch, 2*batch)

        # Mask out self-similarity
        mask = torch.eye(2 * batch_size, device=z.device).bool()
        sim_matrix.masked_fill_(mask, -1e9)

        # Positive pairs: (i, i+batch) and (i+batch, i)
        pos_sim = torch.cat([
            sim_matrix[range(batch_size), range(batch_size, 2*batch_size)],
            sim_matrix[range(batch_size, 2*batch_size), range(batch_size)]
        ])

        # Denominator: sum over all negatives
        exp_sim = torch.exp(sim_matrix)
        exp_sim_sum = exp_sim.sum(dim=1) - torch.exp(torch.diagonal(sim_matrix))  # exclude self

        # NT-Xent loss
        loss = -torch.log(torch.exp(pos_sim) / exp_sim_sum).mean()
        return loss

The training loop is straightforward: for each batch of unlabeled signals, generate two augmented views, encode both, compute contrastive loss, backprop. I ran this for 100 epochs on my 8,000 unlabeled samples (batch size 128, took about 2 hours on a single RTX 3080).

One gotcha: the temperature parameter in the contrastive loss matters way more than I expected. I tried 0.1, 0.5, and 1.0. At 0.1, the model collapsed—all embeddings converged to the same point. At 1.0, training was slow and noisy. 0.5 was the sweet spot, though I suspect this is dataset-dependent.

Fine-tuning on labeled data (where the magic happens)

After pretraining, I froze the encoder and added a simple classifier head:

class FaultClassifier(nn.Module):
    def __init__(self, encoder, num_classes=4):
        super().__init__()
        self.encoder = encoder
        # Freeze encoder during initial fine-tuning
        for param in self.encoder.parameters():
            param.requires_grad = False

        self.classifier = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, num_classes)
        )

    def forward(self, x):
        with torch.no_grad():
            z = self.encoder(x)
        return self.classifier(z)

I trained the classifier head for 50 epochs on the 200 labeled samples. Learning rate 1e-3, Adam optimizer, cross-entropy loss. Validation accuracy plateaued at 87%.

Then I unfroze the encoder and fine-tuned end-to-end for another 20 epochs at a lower learning rate (1e-4). This bumped accuracy to 89%. Without the contrastive pretraining, the same end-to-end supervised training got stuck at 73%.

What surprised me: the pretrained encoder learned to separate fault types even without seeing labels. I visualized the embeddings using t-SNE, and the four fault classes formed distinct clusters before any supervised fine-tuning. The contrastive loss essentially did unsupervised clustering based on signal similarity.

When this approach fails (and it definitely can)

Contrastive learning isn’t magic. Here are the failure modes I hit:

1. Bad augmentations destroy the signal.

Early on, I tried frequency-domain augmentations that randomly scaled entire frequency bands. This obliterated fault harmonics, and the model learned nothing useful. Augmentations need to preserve task-relevant structure. For bearing faults, that means keeping the characteristic frequencies intact while varying everything else.

2. Class imbalance in unlabeled data.

My unlabeled set was 90% normal operation, 10% faults (unknown types). The encoder learned great representations for normal signals but mediocre ones for faults. I partially fixed this by oversampling the labeled fault examples during fine-tuning, but ideally, you’d want more balanced unlabeled data.

3. Distribution shift between unlabeled and labeled data.

The unlabeled data came from a gearbox running at variable speed (1200-1800 RPM). The labeled data was all at 1500 RPM. The pretrained encoder struggled with fixed-speed test samples because it had learned speed-invariant features. This actually helped generalization in production (where speed varies), but hurt benchmark accuracy. If your test set is very different from your unlabeled pool, contrastive pretraining might not help.

4. Computational cost.

Pretraining 100 epochs on 8,000 samples took 2 hours. That’s fine for offline work, but if you’re retraining frequently (concept drift, sensor replacement), the overhead adds up. I haven’t tested this on an edge device, but I’d guess it’s not feasible for on-device training on something like a Raspberry Pi.

Comparing to other semi-supervised approaches

I also tried pseudo-labeling (train on labeled data, predict on unlabeled data, add high-confidence predictions to training set, repeat). Final accuracy: 81%. Better than pure supervised (73%), worse than contrastive pretraining (89%).

My best guess for why contrastive learning won: pseudo-labeling amplifies the model’s existing biases. If the initial model is wrong about a particular signal type, it’ll confidently mislabel unlabeled examples and reinforce the mistake. Contrastive learning, by contrast, learns from the data structure itself without committing to class boundaries prematurely.

I didn’t try other contrastive methods (MoCo, BYOL, SwAV) because SimCLR was the easiest to implement and worked well enough. MoCo uses a momentum encoder and memory bank, which might help with smaller batch sizes (I was limited to 128 due to GPU memory). BYOL doesn’t use negative pairs at all, which could be more stable. But I’m skeptical they’d improve accuracy by more than 1-2% given the effort.

Real-world deployment considerations

In production, I deployed the pretrained encoder + classifier as a frozen model (no online learning). Inference is fast: ~5ms per 2048-sample window on a single CPU core. For a 12 kHz sampling rate, that’s real-time with plenty of headroom.

One issue: sensor drift. After three months, the vibration sensor’s baseline noise floor increased (probably loose mounting), and classification accuracy dropped to 82%. I retrained the classifier head (not the encoder) on a small set of newly labeled drifted data, which brought accuracy back to 87%. The pretrained encoder’s features were robust enough to adapt with minimal new labels.

Another consideration: the model outputs a probability distribution over four classes. In production, I don’t just take argmax—I use a confidence threshold. If max(P) < 0.7, I flag the sample as “uncertain” and route it to a human operator. About 8% of samples fall into this category, which is manageable.

What I’d do differently next time

If I were starting over, I’d invest more in augmentation design. I used generic augmentations (noise, masking, scaling), but domain-specific ones might work better. For example:

Speed variation augmentation: resample the signal to simulate different RPMs
Load variation: scale amplitude in a way that mimics load changes
Multi-sensor fusion: if you have temperature or current data, augment by randomly dropping one sensor modality

I’d also experiment with mixing contrastive and supervised losses during pretraining. Some recent papers (I think Chen et al., 2021, but don’t quote me) suggest that even a small amount of labeled data in the pretraining stage can guide the encoder toward task-relevant features. I didn’t try this because I wanted a clean comparison, but it’s probably worth 1-2% accuracy.

Finally, I’d run this on a larger dataset. 8,000 unlabeled samples sounds like a lot, but modern contrastive methods are data-hungry. ImageNet pretraining uses millions of images. I suspect with 50,000+ unlabeled vibration signals, you could push accuracy even higher.

Use contrastive learning if you have unlabeled data and weak labels

If your labeled set is tiny (<500 samples per class) and you have a pile of unlabeled data from the same distribution, contrastive pretraining is probably worth it. The 16% accuracy gain I saw (73% → 89%) is massive in fault diagnosis, where false negatives can mean unplanned downtime.

Don’t bother if:
– You already have 10,000+ labeled samples (just train supervised)
– Your unlabeled data is from a totally different distribution (you’ll learn the wrong features)
– You can’t invest time in augmentation design (bad augmentations = useless pretraining)

I’m curious whether this scales to multivariate sensor fusion. What if you have vibration, temperature, and current, but labels only for vibration-based faults? Could you pretrain a joint encoder on all three modalities using contrastive learning, then fine-tune on single-modality labels? I haven’t tested it, but the idea is floating around in my head for the next project.

Did you find this helpful?

☕ Buy me a coffee

Self-Supervised Contrastive Learning for Fault Diagnosis with Limited Labels

I stopped chasing labeled data after this worked

How contrastive learning actually works (without the academic fluff)

Fine-tuning on labeled data (where the magic happens)

When this approach fails (and it definitely can)

Comparing to other semi-supervised approaches

Real-world deployment considerations

What I’d do differently next time

Use contrastive learning if you have unlabeled data and weak labels

Comments

Leave a Reply Cancel reply

Self-Supervised Contrastive Learning for Fault Diagnosis with Limited Labels

I stopped chasing labeled data after this worked

How contrastive learning actually works (without the academic fluff)

Fine-tuning on labeled data (where the magic happens)

When this approach fails (and it definitely can)

Comparing to other semi-supervised approaches

Real-world deployment considerations

What I’d do differently next time

Use contrastive learning if you have unlabeled data and weak labels

Related Posts

Comments

Leave a Reply Cancel reply