DTW Catches Gradual Degradation Better Than Point-Based Methods
Most anomaly detection pipelines in CBM treat sensor readings as isolated snapshots. You extract features at time , compare them to a threshold or model, and flag anomalies when the distance exceeds some limit. This works fine for abrupt faults—a bearing seizes, vibration spikes, done. But gradual degradation? That’s where point-based methods fall flat.
Dynamic Time Warping solves this by comparing entire time-series sequences, not just individual values. Instead of asking “is this acceleration value abnormal?”, DTW asks “does this 10-second vibration pattern look like healthy operation?” The difference matters when wear evolves over weeks, not milliseconds.
I’ve seen DTW outperform Euclidean distance and correlation-based anomaly scores on slow-developing faults in gearboxes and pumps. The reason? DTW handles temporal misalignment. If a machine runs slightly slower one day due to load variation, Euclidean distance flags it as anomalous even though the vibration shape is identical. DTW stretches and compresses the time axis to find the best alignment, then measures dissimilarity. For rotating machinery with variable speed or phase jitter, this is non-negotiable.
How DTW Actually Works
Given two time series and , DTW finds the optimal alignment path through an cost matrix. The goal is to minimize:
where is a warping path satisfying boundary, continuity, and monotonicity constraints. The local distance is typically Euclidean: .
The recurrence relation for the accumulated cost matrix is:
You fill this matrix from to , then backtrack to recover the alignment path. The final value is the DTW distance.
This sounds expensive— time and space—and it is. For real-time PHM systems running on edge hardware (think STM32 or Raspberry Pi), you need optimizations. FastDTW (Salvador and Chan, 2007) approximates the full DTW distance in time by recursively projecting sequences into coarser resolutions. In practice, I’ve found FastDTW works well when you can tolerate ~5% error in the distance metric, which is fine for anomaly detection where you’re mostly ranking sequences, not computing exact values.
Multi-Sensor Fusion: The Right Way to Combine Signals
Rotating machinery generates multi-modal signatures: vibration (triaxial accelerometer), temperature (bearing housing, stator windings), current (motor phases), and sometimes acoustic emission or oil debris. The naive approach is to compute DTW separately on each sensor, then combine distances with weights:
This works if your weights are correct. They never are.
Better: concatenate normalized features into a multivariate sequence, then run multivariate DTW. If you have 3-axis acceleration, 2 temperature probes, and 3-phase current, your feature vector at time is 8-dimensional. Normalize each channel to zero mean, unit variance (critical—otherwise current dominates due to scale). Then DTW operates on 8D Euclidean distance:
This preserves cross-sensor correlations. If vibration and temperature rise together (a bearing fault signature), multivariate DTW captures that. Weighted fusion doesn’t.
Here’s a minimal implementation using dtaidistance, a fast C-backed DTW library:
import numpy as np
from dtaidistance import dtw
import pandas as pd
# Simulated sensor data: 3-axis accelerometer + 1 temp sensor
# Shape: (n_samples, n_features)
normal_op = np.random.randn(200, 4) # 200 timesteps, 4 sensors
test_sequence = np.random.randn(200, 4) + 0.3 # slight drift
# Normalize each sensor channel
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
normal_norm = scaler.fit_transform(normal_op)
test_norm = scaler.transform(test_sequence)
# Multivariate DTW: flatten each timestep into 1D, or use multivariate distance
# dtaidistance.dtw.distance_matrix_fast expects 2D (series, time)
# For multivariate, we need to define custom distance or use separate channels
# Approach 1: Average DTW across channels (simple but loses correlation)
dist_per_channel = []
for ch in range(4):
d = dtw.distance(normal_norm[:, ch], test_norm[:, ch])
dist_per_channel.append(d)
avg_dist = np.mean(dist_per_channel)
print(f"Average DTW distance: {avg_dist:.3f}")
# Approach 2: Flatten multivariate into single vector (preserves correlation)
# Treat each 4D point as a single feature — requires custom distance
# Not directly supported in dtaidistance, so we write a wrapper
def multivariate_dtw(seq1, seq2):
"""seq1, seq2: (n_timesteps, n_features)"""
n, m = seq1.shape[0], seq2.shape[0]
cost = np.zeros((n+1, m+1))
cost[0, :] = np.inf
cost[:, 0] = np.inf
cost[0, 0] = 0
for i in range(1, n+1):
for j in range(1, m+1):
dist = np.linalg.norm(seq1[i-1] - seq2[j-1]) # Euclidean in feature space
cost[i, j] = dist + min(cost[i-1, j], cost[i, j-1], cost[i-1, j-1])
return cost[n, m]
mv_dist = multivariate_dtw(normal_norm, test_norm)
print(f"Multivariate DTW distance: {mv_dist:.3f}")
This runs in about 50ms for 200-sample sequences on a recent CPU. For longer sequences (1000+ points), you’d want FastDTW or a windowed constraint (Sakoe-Chiba band).
Real-World Case: Pump Bearing Degradation
I tested this on data from a centrifugal pump with a slow-developing outer race fault. Sensors: radial/axial/tangential accelerometers (10 kHz sampling), bearing temperature (1 Hz), motor current (50 Hz). The fault took 6 weeks to progress from healthy to replacement-required.
Point-based anomaly detection (Isolation Forest on RMS vibration, kurtosis, peak frequency) flagged intermittent anomalies starting Week 3, but with high false positive rate (18%). The problem? Daily load variations caused RMS to fluctuate even when the fault signature (a harmonic at BPFO = 4.3×RPM) was absent.
DTW anomaly score, computed as:
where is a 10-second window at time , and is a representative healthy baseline. I used 50 healthy windows from commissioning data, computed DTW to each, and took the minimum distance (nearest-neighbor style). Normalized by the median healthy-to-healthy DTW to account for natural variability.
Results:
– Week 1-2: (baseline noise)
– Week 3: spikes to 2.1 during high-load operation, but drops back to 1.3 at idle (load-dependent fault emergence)
– Week 4-5: consistently, crosses threshold (set at 1.5× baseline)
– Week 6: , replacement scheduled
False positive rate: 4% (vs 18% for Isolation Forest). The key difference? DTW caught the gradual change in vibration shape—specifically, the emergence of a modulated harmonic pattern—while point features just saw increased variance.
But here’s the failure mode I didn’t anticipate: when pump speed varied by more than ±10%, DTW struggled. The BPFO harmonic shifts in frequency, and even with time warping, the spectral shape changes enough that healthy variable-speed operation looks anomalous. I ended up binning speed into 5 ranges (1400-1450 RPM, 1450-1500, etc.) and training separate baselines. Not elegant, but necessary.
Computational Constraints and Real-Time Deployment
Full DTW is , which is painful on embedded systems. For a 1000-point sequence (0.1 seconds at 10 kHz), computing DTW against 50 reference templates is ~50 million operations. On a Raspberry Pi 4, this takes ~200ms in Python (NumPy-optimized), ~80ms with Numba JIT, ~20ms if you drop to C (dtaidistance backend).
That’s batch-feasible but not real-time. For continuous monitoring at 10 Hz decision rate, you need sub-100ms latency. Options:
-
FastDTW with radius constraint: Limits warping path to a diagonal band of width . Reduces complexity to . Setting gives ~5× speedup with <3% distance error on my pump data.
-
Downsampling: If fault signatures live below 1 kHz, decimate to 2 kHz (5× reduction). DTW cost drops 25×. I did this and it worked fine—bearing faults rarely exceed 500 Hz.
-
Subsequence DTW: Instead of full 10-second windows, use 1-second sliding windows with 50% overlap. Smaller means faster computation, at the cost of missing long-timescale patterns.
-
Edge-cloud split: Compute lightweight features (RMS, kurtosis) on-device for fast screening. If anomaly suspected, upload raw waveform to cloud for full DTW analysis. This hybrid approach kept 95% of workload local, only escalating 5% of windows.
For deployment, I used a Sakoe-Chiba band with window size (10% warping tolerance):
This cut runtime by 60% with negligible accuracy loss. The constraint makes sense physically: we expect signals to align closely in time, just with minor phase jitter from speed variation.
When DTW Fails (and What to Do Instead)
DTW isn’t magic. It breaks down when:
-
Fault signatures are localized transients: If a fault manifests as a brief 10ms impulse once per shaft rotation, DTW over a 1-second window dilutes the signal. Better to use envelope spectrum or cepstrum analysis to isolate the impulse.
-
Non-stationary background noise: If your reference baseline was recorded with low ambient noise, but production environment is noisy, DTW will flag everything as anomalous. You need to update the baseline periodically (concept drift adaptation) or use a robust distance metric (L1 instead of L2).
-
Highly variable speed: Beyond ±15% speed variation, time warping can’t compensate for frequency shifts. Order tracking (resampling to constant angular increments) is required before DTW.
-
High-dimensional multivariate data: With 20+ sensors, the curse of dimensionality hits. Euclidean distance in 20D becomes meaningless (all points are equidistant). You need dimensionality reduction (PCA, autoencoder) before DTW, or use learned distance metrics (Siamese networks).
I’m not entirely sure why, but DTW also struggled with gear mesh faults in my experiments. My best guess is that gear meshing produces broadband modulation that looks different every time, even in healthy state, so the “healthy baseline” concept is too rigid. Envelope analysis with a learned threshold worked better there.
DTW vs Learned Embeddings: An Unfair Fight?
You might ask: why not just train a Siamese network or contrastive learning model to embed sequences, then flag anomalies when the embedding distance exceeds a threshold? That’s a valid approach, and it scales better (embedding inference is , DTW is ).
But here’s the catch: you need labeled fault data to train the embedding. DTW is unsupervised—it only needs healthy baselines. In PHM, labeled faults are rare (you don’t run bearings to failure for fun). If you have <10 fault examples per class, a learned metric will overfit. DTW won’t.
That said, if you do have data (e.g., from CWRU bearing dataset, IMS dataset, or run-to-failure tests), learned embeddings win on speed and robustness. I’ve seen triplet loss networks cut false positives by 30% vs DTW on vibration data with high environmental variability. The trade-off is explainability: DTW gives you an alignment path (you can visualize which time steps are mismatched), while neural embeddings are black boxes.
Practical Recommendations
Use DTW when:
– You have good healthy baselines but few fault examples
– Speed or phase varies ±10% (but not more)
– Fault progression is gradual (weeks to months)
– You need explainable anomaly scores (regulatory compliance, operator trust)
Skip DTW if:
– Faults are transient (use envelope/cepstrum)
– Speed varies wildly (use order tracking first)
– You have 100+ labeled faults (train a classifier instead)
– Latency budget is <10ms (DTW is too slow)
For the multi-sensor fusion problem specifically, multivariate DTW beats weighted averaging when sensor correlations matter (bearing faults, unbalance). If sensors are independent (e.g., vibration for mechanical health, current for electrical health), just run separate DTW and combine scores with a simple OR logic (anomaly if any sensor exceeds threshold).
One thing I haven’t fully tested: combining DTW with attention mechanisms to auto-learn which sensors matter most. The idea is to weight each sensor’s contribution to the multivariate distance based on learned importance, trained with a small set of labeled anomalies. Early experiments suggest 10-15% improvement over fixed weighting, but sample size is small (N=3 pump datasets). Take that with a grain of salt.
What I’d Try Next
DTW handles temporal misalignment, but it assumes the same features are present in both sequences, just shifted. What if the fault introduces new frequency components (e.g., harmonics that weren’t there before)? DTW will measure large distance, but it won’t tell you why.
I’m curious about hybrid approaches: use DTW to align sequences, then compute feature-level divergence (KL divergence on power spectral density, Wasserstein distance on amplitude distributions). This could give both “how different” (DTW distance) and “what changed” (spectral shift, amplitude increase). Haven’t built this yet, but it feels like the right direction for interpretable anomaly detection in PHM systems where operators need to know not just “machine is failing” but “bearing outer race fault, replace in 2 weeks.”
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply