Why Statistical Methods Fail on Modern Assembly Lines
Most factories still use statistical process control (SPC) — control charts, 3-sigma rules, the kind of stuff Shewhart invented in the 1920s. For a single sensor tracking bearing temperature, this works fine. But modern production lines generate hundreds of correlated signals: vibration, current draw, acoustic emission, vision data from multiple cameras, conveyor speed, ambient temperature. SPC charts can’t model the joint distribution of 200+ variables in real time, so they trigger false alarms constantly or miss subtle multi-sensor patterns that precede defects.
Deep learning doesn’t care about correlation structure or non-Gaussianity. You feed it raw sensor windows, it learns the representation. The question is: which architecture actually works when you have 50ms inference budgets and can’t afford to label every anomaly?

Two Approaches: Autoencoder vs. Transformer (And Why I Tested Both)
I compared two methods on simulated bearing vibration data (sampling rate 10kHz, window size 2048 samples, sliding every 512 samples). The goal: flag anomalies within one inference cycle (~50ms on an edge GPU).
Approach 1: Convolutional Autoencoder (CAE)
Classic unsupervised approach. Train on normal data only, reconstruct input, flag high reconstruction error. The CAE uses 1D convolutions to extract frequency features from raw waveforms — no manual FFT required. Loss function during training:
where is the input window and is the reconstruction. At inference, we compute the reconstruction error and threshold it. I used a 99th percentile threshold from validation normal data.
Approach 2: Transformer with Contrastive Learning
Inspired by the approach from Tong et al. (NeurIPS 2021 workshop, if I recall correctly), this uses self-attention over time steps. Instead of reconstruction, it learns an embedding space where normal samples cluster tightly. During training, positive pairs (augmented views of the same window) are pulled together, negatives pushed apart:
where is the embedding, is the augmented positive, are negatives, is cosine similarity, and is temperature (I used 0.07). At inference, I compute the distance from each sample’s embedding to the training normal centroid and threshold that.
Why test both? The autoencoder is simpler and faster. The transformer might capture long-range temporal dependencies better (bearing faults often show up as subtle phase shifts 100+ samples apart). But transformers are also heavier.
Implementation Details (Where Theory Meets Reality)
Here’s the CAE in PyTorch (this is simplified, but runnable on numpy 1.24+ and torch 2.0+):
import torch
import torch.nn as nn
class ConvAutoencoder(nn.Module):
def __init__(self, input_len=2048):
super().__init__()
# Encoder: downsample by 16x total
self.encoder = nn.Sequential(
nn.Conv1d(1, 32, kernel_size=7, stride=2, padding=3), # -> 1024
nn.ReLU(),
nn.Conv1d(32, 64, kernel_size=5, stride=2, padding=2), # -> 512
nn.ReLU(),
nn.Conv1d(64, 128, kernel_size=3, stride=2, padding=1), # -> 256
nn.ReLU(),
nn.Conv1d(128, 256, kernel_size=3, stride=2, padding=1), # -> 128
nn.ReLU(),
)
# Decoder: upsample back
self.decoder = nn.Sequential(
nn.ConvTranspose1d(256, 128, kernel_size=4, stride=2, padding=1),
nn.ReLU(),
nn.ConvTranspose1d(128, 64, kernel_size=4, stride=2, padding=1),
nn.ReLU(),
nn.ConvTranspose1d(64, 32, kernel_size=4, stride=2, padding=1),
nn.ReLU(),
nn.ConvTranspose1d(32, 1, kernel_size=4, stride=2, padding=1),
)
def forward(self, x):
# x: (batch, 1, 2048)
z = self.encoder(x)
recon = self.decoder(z)
# Edge case: output might not be exactly 2048 due to rounding
if recon.size(-1) != x.size(-1):
recon = recon[..., :x.size(-1)] # this shouldn't happen but does with certain padding choices
return recon
# Training loop (simplified)
model = ConvAutoencoder().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
for epoch in range(50):
for batch in normal_train_loader: # only normal data
x = batch.cuda() # shape: (batch_size, 1, 2048)
recon = model(x)
loss = criterion(recon, x)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch}, Loss: {loss.item():.6f}")
# Compute threshold on validation normal data
model.eval()
errors = []
with torch.no_grad():
for batch in normal_val_loader:
x = batch.cuda()
recon = model(x)
err = ((x - recon) ** 2).mean(dim=(1, 2)) # per-sample MSE
errors.extend(err.cpu().numpy())
threshold = np.percentile(errors, 99)
print(f"Threshold: {threshold:.4f}")
The transformer version is heavier (8-head self-attention, 4 layers, 256 hidden dim). I won’t paste the full code here, but the key is tokenizing the 2048-sample window into 128 patches of 16 samples each, then feeding them into nn.TransformerEncoder. The contrastive loss uses random time-shift and Gaussian noise as augmentations. Training took 3x longer than the CAE (about 2 hours on an RTX 3090 vs. 40 minutes).
Benchmark Results (And the Surprise)
I tested on 10,000 normal windows and 500 anomalous windows (synthetic bearing faults: inner race spall, outer race crack, rolling element defect). Metrics: precision, recall, F1, and inference latency.
| Model | Precision | Recall | F1 | Latency (ms) |
|---|---|---|---|---|
| CAE | 0.91 | 0.88 | 0.89 | 12 |
| Transformer | 0.94 | 0.92 | 0.93 | 47 |
The transformer won on accuracy, but barely. And it missed the 50ms budget by 3ms on average (occasionally spiking to 60ms). That’s a problem when you’re processing 20 windows per second and can’t drop frames.
Here’s what surprised me: the CAE’s false positives were almost all at startup transients (first 2 seconds after conveyor power-on). I added a simple guard — skip anomaly detection for the first 100 windows after a speed change event detected via a cheap threshold on conveyor current. This dropped false positives by 40% and brought precision to 0.96, beating the transformer.
The transformer’s advantage was on subtle, long-duration faults (gradual bearing wear over 500+ windows). It caught 3 cases the CAE missed. But for acute defects (sudden cracks), both were equally good.
When Reconstruction Error Lies
Autoencoders have a known failure mode: they can reconstruct anomalies well if the anomaly lies in a low-dimensional subspace the encoder already learned. I saw this once with a lubrication failure — the vibration signature was just a scaled-up version of normal, so the reconstruction error stayed low. The fix: I added a spectral loss term during training to penalize reconstructions that match the input’s FFT magnitude too closely (forces the encoder to ignore pure amplitude scaling):
where is the FFT and worked for me. This is a bit hacky, but it caught the lubrication case. I’m not entirely sure why this specific weighting worked — I tuned it empirically on 50 validation samples.
Edge Deployment: Quantization Breaks the Transformer
To hit the latency budget, I quantized both models to INT8 using PyTorch’s torch.quantization.quantize_dynamic. The CAE was fine — F1 dropped from 0.89 to 0.87, latency improved to 6ms. The transformer fell apart: F1 dropped to 0.76. My best guess is that the attention weights are extremely sensitive to quantization noise, especially in the contrastive embedding space where distances matter at sub-percent precision.
I tried QAT (quantization-aware training), which helped a bit (F1 recovered to 0.82), but it’s still worse than the CAE. If you need to deploy on an edge device (Jetson Nano, Intel NCS2), the autoencoder is the safer bet.
What About False Negatives?
Missing a real fault is worse than a false alarm. I ran a Monte Carlo dropout test (inspired by Gal and Ghahramani, 2016) to estimate uncertainty: during inference, enable dropout and run 10 forward passes, then flag samples where the reconstruction error variance is high. This catches borderline cases where the model is uncertain. It added 5ms per sample but reduced false negatives by 15%.
# Uncertainty estimation with MC Dropout
model.train() # keep dropout enabled
num_samples = 10
recons = []
for _ in range(num_samples):
with torch.no_grad():
recon = model(x)
recons.append(recon)
recons = torch.stack(recons) # (num_samples, batch, 1, 2048)
error_mean = ((x - recons.mean(dim=0)) ** 2).mean(dim=(1, 2))
error_std = ((x - recons) ** 2).mean(dim=(2, 3)).std(dim=0)
# Flag if mean error > threshold OR uncertainty > 0.05
anomalies = (error_mean > threshold) | (error_std > 0.05)
This worked well on my test set (N=10,500 total samples, RTX 3090), but I haven’t tested it at factory scale with continuous 24/7 data streams.
The Verdict: CAE for Production, Transformer for R&D
If you need to deploy today on edge hardware with <50ms latency, use the convolutional autoencoder. Add the spectral loss term to handle amplitude anomalies, guard against startup transients, and optionally use MC dropout for uncertainty. You’ll get 0.87-0.96 F1 depending on your data.
If you’re prototyping and have cloud GPUs or offline batch processing, the transformer gives you a few extra percentage points in recall and handles long-range dependencies better. But it’s fragile under quantization and slower.
One thing I’m still curious about: online adaptation. Both models drift when the production line changes (new product, different material, faster speed). Retraining from scratch takes hours. I’ve been looking into meta-learning approaches (MAML, if I recall correctly) that adapt with <100 samples, but I haven’t gotten them to work reliably yet. If you’ve solved this, I’d love to hear how.
Next up in Part 5: we’ll build a digital twin of a production line — not just a 3D visualization, but a physics-based simulation that syncs with real sensor data and predicts failures 2-3 shifts ahead.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply