Why vibration data is harder than you think
Most CBM tutorials start with clean CSV files and skip the part where you actually collect data from a real machine. That’s like teaching someone to cook by handing them a plated meal.
The truth is, sensor data collection is where half your problems live. Sampling rates are never quite right, timestamps drift, sensors fail mid-run, and nobody tells you that a loose cable looks identical to bearing failure in your downstream model. I’ve seen production systems flag maintenance alerts because someone bumped an accelerometer during a shift change.
This part covers the messy reality of building a CBM pipeline from sensor to preprocessed feature-ready dataset. We’re simulating a motor test bench with vibration and temperature sensors, but the patterns here apply whether you’re monitoring pumps, turbines, or conveyor belts.

The obvious approach: log everything at maximum resolution
When you first instrument a machine, the instinct is to crank up the sampling rate. Vibration analysis textbooks say you need at least 2× your highest frequency of interest (Nyquist), so if you’re looking for bearing defects around 1-5 kHz, you’d want 10 kHz sampling minimum. Temperature changes slower, so maybe 1 Hz is fine.
Here’s what a naive data logger looks like:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import time
class NaiveDataLogger:
def __init__(self, vibration_hz=10000, temp_hz=1):
self.vib_rate = vibration_hz
self.temp_rate = temp_hz
self.buffer = []
def collect_run(self, duration_sec=60):
"""Collect one run of sensor data."""
start = datetime.now()
vib_interval = 1.0 / self.vib_rate
temp_interval = 1.0 / self.temp_rate
# Simulate sensor readings
for i in range(int(duration_sec * self.vib_rate)):
timestamp = start + timedelta(seconds=i * vib_interval)
vib_value = self._read_vibration() # hypothetical sensor read
self.buffer.append({
'timestamp': timestamp,
'vibration': vib_value,
'temperature': np.nan # sparse, filled separately
})
# This creates a 60-second run at 10 kHz = 600,000 rows
return pd.DataFrame(self.buffer)
def _read_vibration(self):
# Simulated accelerometer (m/s²)
baseline = np.random.normal(0, 0.5) # healthy vibration
return baseline
Run this for one minute and you’ve got 600,000 vibration samples. A full 8-hour shift generates 288 million rows. At 16 bytes per timestamp + 8 bytes per float, that’s ~7 GB per day per machine, before any redundancy or metadata.
And here’s the kicker: 95% of that data is useless. Steady-state operation with nothing interesting happening. But you don’t know which 5% matters until after you’ve collected it.
What breaks first: storage, or your sanity
The storage math gets ugly fast. If you’re monitoring 10 machines across a plant, you’re looking at 70 GB/day raw vibration data. Edge devices (Raspberry Pi, industrial PLCs) typically have 32-64 GB storage. You hit capacity in under a day, and shipping terabytes to the cloud costs real money on cellular or satellite links.
But storage isn’t even the worst part. The worst part is timestamp drift.
Most cheap DAQ systems use system clock timestamps, which can drift 10-50 ms per hour depending on temperature and clock quality. Over an 8-hour shift, your timestamps might be off by several seconds. When you’re trying to correlate a temperature spike with a vibration transient, that error destroys causality.
I’ve debugged systems where the “failure signature” disappeared entirely because preprocessing aligned data by wall-clock time instead of machine state. The temperature sensor and vibration sensor were running on separate clocks, drifting opposite directions, and by the time we resampled to a common timebase, the 200 ms temperature spike that preceded the vibration fault was 3 seconds off and got averaged out.
The fix: trigger-based collection with synchronized clocks
Instead of logging continuously, log strategically. Most machines spend most of their time in steady state — you don’t need 10 kHz data to know a motor is running normally. What you need is:
- Baseline snapshots (low rate, 10 Hz): continuous monitoring to detect state changes
- High-resolution bursts (10 kHz): triggered when baseline crosses a threshold
- Hardware-synchronized timestamps: GPS, PTP, or at minimum NTP with millisecond accuracy
Here’s a revised logger:
import numpy as np
import pandas as pd
from collections import deque
from datetime import datetime
class TriggeredLogger:
def __init__(self, baseline_hz=10, burst_hz=10000,
burst_duration=2.0, trigger_threshold=2.0):
self.baseline_rate = baseline_hz
self.burst_rate = burst_hz
self.burst_len = int(burst_duration * burst_hz)
self.threshold = trigger_threshold # m/s² RMS
# Ring buffer for pre-trigger context
self.pretrigger = deque(maxlen=int(baseline_hz * 5)) # 5 sec history
def monitor(self, duration_sec=3600):
"""Run monitoring loop for specified duration."""
bursts = []
baseline_log = []
start = datetime.now()
interval = 1.0 / self.baseline_rate
for i in range(int(duration_sec * self.baseline_rate)):
timestamp = start + pd.Timedelta(seconds=i * interval)
# Read baseline (downsampled or RMS windowed)
vib_rms = self._read_baseline_vibration()
temp = self._read_temperature()
baseline_log.append({
'timestamp': timestamp,
'vib_rms': vib_rms,
'temperature': temp
})
self.pretrigger.append((timestamp, vib_rms, temp))
# Trigger detection
if vib_rms > self.threshold:
print(f"Trigger at {timestamp}: {vib_rms:.2f} m/s² RMS")
burst = self._capture_burst(timestamp)
bursts.append(burst)
return pd.DataFrame(baseline_log), bursts
def _capture_burst(self, trigger_time):
"""High-resolution capture around trigger event."""
# In real system: switch DAQ to high-rate mode
# Capture includes pre-trigger buffer from ring
burst_data = []
# Include 0.5 sec pre-trigger from ring buffer (if available)
pretrig_samples = list(self.pretrigger)[-int(self.baseline_rate * 0.5):]
# Capture 2 sec at burst rate
for i in range(self.burst_len):
t = trigger_time + pd.Timedelta(seconds=i / self.burst_rate)
vib_raw = self._read_vibration_raw()
burst_data.append({'timestamp': t, 'vibration': vib_raw})
return pd.DataFrame(burst_data)
def _read_baseline_vibration(self):
# RMS over 0.1 sec window (simulated)
samples = np.random.normal(0, 0.5, size=1000) # healthy baseline
# Occasionally inject transient
if np.random.random() < 0.001: # 0.1% chance
samples += np.random.normal(5, 2, size=1000) # fault spike
return np.sqrt(np.mean(samples**2))
def _read_vibration_raw(self):
return np.random.normal(0, 0.5)
def _read_temperature(self):
# Bearing temperature, °C
return 45 + np.random.normal(0, 2)
This approach drops storage from 7 GB/day to maybe 50 MB/day (depends on trigger frequency). A typical healthy machine might trigger 10-20 times per shift on startup transients or load changes. A degrading bearing might trigger 100+ times. That’s the signal you want.
Preprocessing: resampling, alignment, and the curse of NaNs
Once you’ve got data, you need to align multi-rate sensors onto a common timeline. Temperature at 1 Hz, vibration RMS at 10 Hz, high-res bursts at 10 kHz — they all need to merge for feature engineering.
The naive approach is df.resample('100ms').mean(), which works until you hit edge cases:
- Forward-fill vs interpolation: Temperature changes slowly, so forward-fill makes sense. Vibration peaks get destroyed by averaging. You need context-aware resampling.
- Burst alignment: High-res bursts aren’t on regular intervals. You can’t just resample them — you need to extract features (RMS, peak, kurtosis) first, then align those features.
- Missing data: Sensors fail. Cables get unplugged. Your pipeline needs to handle gaps without silently filling them with garbage.
Here’s a preprocessor that handles these cases:
class CBMPreprocessor:
def __init__(self, target_rate='100ms'):
self.target_rate = target_rate
def align_sensors(self, baseline_df, bursts):
"""Merge baseline and burst-derived features onto common timeline."""
# Resample baseline (forward-fill temp, interpolate vibration RMS)
baseline_df = baseline_df.set_index('timestamp')
resampled = baseline_df.resample(self.target_rate).agg({
'vib_rms': 'mean', # average RMS over window
'temperature': 'ffill' # carry forward last reading
})
# Extract features from bursts and merge
burst_features = self._extract_burst_features(bursts)
if not burst_features.empty:
# Merge on nearest timestamp (within tolerance)
merged = pd.merge_asof(
resampled.reset_index(),
burst_features,
on='timestamp',
direction='nearest',
tolerance=pd.Timedelta('1s')
)
else:
merged = resampled.reset_index()
merged['burst_peak'] = np.nan
merged['burst_kurtosis'] = np.nan
# Quality flags: mark interpolated regions
merged['temp_interpolated'] = merged['temperature'].isna()
merged['temperature'].fillna(method='ffill', inplace=True)
# Drop rows with critical missing data
merged = merged.dropna(subset=['vib_rms'])
return merged
def _extract_burst_features(self, bursts):
"""Compute statistical features from high-res bursts."""
if not bursts:
return pd.DataFrame()
features = []
for burst_df in bursts:
if burst_df.empty:
continue
timestamp = burst_df['timestamp'].iloc[len(burst_df)//2] # mid-burst
vib = burst_df['vibration'].values
# Time-domain features
peak = np.max(np.abs(vib))
rms = np.sqrt(np.mean(vib**2))
kurtosis = self._kurtosis(vib)
# Frequency-domain features (FFT for bearing fault detection)
fft_mag = np.abs(np.fft.rfft(vib))
freq = np.fft.rfftfreq(len(vib), d=1/10000)
# Bearing defect bands (example: 100-500 Hz for outer race)
# This is domain-specific; adjust for your machinery
bearing_band = fft_mag[(freq > 100) & (freq < 500)]
bearing_energy = np.sum(bearing_band**2)
features.append({
'timestamp': timestamp,
'burst_peak': peak,
'burst_rms': rms,
'burst_kurtosis': kurtosis,
'bearing_band_energy': bearing_energy
})
return pd.DataFrame(features)
def _kurtosis(self, x):
"""Kurtosis (fourth moment): sensitive to transients."""
# K = E[(X - μ)^4] / σ^4
mean = np.mean(x)
std = np.std(x)
if std == 0: # shouldn't happen, but guard anyway
return 0
return np.mean(((x - mean) / std) ** 4)
Kurtosis is gold for fault detection. Healthy vibration is roughly Gaussian (). Bearing spalls create impulses that spike kurtosis to 10+. It’s one of the first features to degrade before RMS catches up.
Bearing defect energy is even more targeted. Rolling element bearings produce harmonics at specific frequencies based on geometry and RPM. For a typical motor at 1800 RPM with a bearing that has a Ball Pass Frequency Outer race (BPFO) defect, you’d see peaks around:
where is number of balls, is shaft frequency (30 Hz for 1800 RPM), is ball-to-pitch diameter ratio, and is contact angle. For a typical bearing, this lands around 100-200 Hz. Monitoring energy in that band beats looking at broadband RMS.
Data quality: the silent killer
You can have perfect algorithms and still fail if your data quality is bad. Here are the gotchas I’ve hit:
Sensor mounting: An accelerometer that’s slightly loose will read lower than reality and add noise. In one case, a magnetic mount lost 20% grip due to bearing heat, and the “degradation signature” was actually the sensor sliding around. We only caught it by cross-checking with a second sensor.
Electrical noise: Motors generate EMI. Unshielded cables pick it up. You’ll see 60 Hz (or 50 Hz in Europe) spikes in your FFT that have nothing to do with mechanical condition. Solution: twisted-pair shielded cables, grounded properly, and notch filters in preprocessing. Or just ignore the 60 Hz bin in your features.
Temperature lag: Thermocouples have thermal mass. A bearing temperature spike takes 5-10 seconds to show up in your sensor reading. If you’re correlating with vibration events, account for that lag or you’ll miss the signature. I’ve seen people try to predict temperature from vibration and get terrible results because they didn’t shift the temperature signal back in time.
Sampling jitter: Most DAQ systems aren’t hard real-time. Your “10 kHz” sampling might actually be 9.8-10.2 kHz depending on CPU load. Over a 2-second burst, that’s 40 samples of jitter. For FFT-based features, that smears frequency bins. If you’re doing precise fault frequency detection, you need either hardware-timed sampling or post-processing to correct jitter (resample to exact intervals via interpolation).
What I wish I’d known earlier
Start with less data, higher quality. One week of clean, well-labeled, synchronized data from a machine with known faults is worth a year of noisy unstructured logs.
Log your preprocessing pipeline version alongside your data. When you change your resampling method or add a new feature six months in, you need to know which models were trained on which preprocessing version. I’ve seen teams retrain models on subtly different data and wonder why performance dropped.
And test your sensor failures explicitly. Unplug a cable mid-run and make sure your pipeline doesn’t silently forward-fill 10 minutes of NaNs and call it healthy. Inject synthetic dropouts and verify your quality flags trigger.
I’m still not entirely sure what the right balance is between edge processing (compute features on the device, send only summaries) versus cloud processing (send raw bursts, centralize feature engineering). Edge saves bandwidth and lets you react faster, but cloud gives you flexibility to retroactively reprocess data when you discover a better feature. For this project, I’m doing hybrid: baseline + burst features on edge, raw bursts uploaded for model development.
Next: turning signals into health indicators
You’ve got clean, aligned, quality-flagged data. Next part covers feature engineering: extracting degradation-sensitive features from vibration and temperature, labeling data with Remaining Useful Life (RUL) targets, and handling the class imbalance problem (healthy data vastly outnumbers fault data). We’ll also get into domain-specific features like envelope analysis and cepstrum-based gear mesh detection — techniques that make the difference between a model that works in a lab and one that works in production.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply