Building a Condition-Based Maintenance System from Scratch: Sensor Data Collection and Preprocessing

,
Updated Feb 6, 2026

Why vibration data is harder than you think

Most CBM tutorials start with clean CSV files and skip the part where you actually collect data from a real machine. That’s like teaching someone to cook by handing them a plated meal.

The truth is, sensor data collection is where half your problems live. Sampling rates are never quite right, timestamps drift, sensors fail mid-run, and nobody tells you that a loose cable looks identical to bearing failure in your downstream model. I’ve seen production systems flag maintenance alerts because someone bumped an accelerometer during a shift change.

This part covers the messy reality of building a CBM pipeline from sensor to preprocessed feature-ready dataset. We’re simulating a motor test bench with vibration and temperature sensors, but the patterns here apply whether you’re monitoring pumps, turbines, or conveyor belts.

Flat lay of stock market analysis documents with magnifying glass, pens, and glasses.
Photo by Hanna Pad on Pexels

The obvious approach: log everything at maximum resolution

When you first instrument a machine, the instinct is to crank up the sampling rate. Vibration analysis textbooks say you need at least 2× your highest frequency of interest (Nyquist), so if you’re looking for bearing defects around 1-5 kHz, you’d want 10 kHz sampling minimum. Temperature changes slower, so maybe 1 Hz is fine.

Here’s what a naive data logger looks like:

import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import time

class NaiveDataLogger:
    def __init__(self, vibration_hz=10000, temp_hz=1):
        self.vib_rate = vibration_hz
        self.temp_rate = temp_hz
        self.buffer = []

    def collect_run(self, duration_sec=60):
        """Collect one run of sensor data."""
        start = datetime.now()
        vib_interval = 1.0 / self.vib_rate
        temp_interval = 1.0 / self.temp_rate

        # Simulate sensor readings
        for i in range(int(duration_sec * self.vib_rate)):
            timestamp = start + timedelta(seconds=i * vib_interval)
            vib_value = self._read_vibration()  # hypothetical sensor read
            self.buffer.append({
                'timestamp': timestamp,
                'vibration': vib_value,
                'temperature': np.nan  # sparse, filled separately
            })

        # This creates a 60-second run at 10 kHz = 600,000 rows
        return pd.DataFrame(self.buffer)

    def _read_vibration(self):
        # Simulated accelerometer (m/s²)
        baseline = np.random.normal(0, 0.5)  # healthy vibration
        return baseline

Run this for one minute and you’ve got 600,000 vibration samples. A full 8-hour shift generates 288 million rows. At 16 bytes per timestamp + 8 bytes per float, that’s ~7 GB per day per machine, before any redundancy or metadata.

And here’s the kicker: 95% of that data is useless. Steady-state operation with nothing interesting happening. But you don’t know which 5% matters until after you’ve collected it.

What breaks first: storage, or your sanity

The storage math gets ugly fast. If you’re monitoring 10 machines across a plant, you’re looking at 70 GB/day raw vibration data. Edge devices (Raspberry Pi, industrial PLCs) typically have 32-64 GB storage. You hit capacity in under a day, and shipping terabytes to the cloud costs real money on cellular or satellite links.

But storage isn’t even the worst part. The worst part is timestamp drift.

Most cheap DAQ systems use system clock timestamps, which can drift 10-50 ms per hour depending on temperature and clock quality. Over an 8-hour shift, your timestamps might be off by several seconds. When you’re trying to correlate a temperature spike with a vibration transient, that error destroys causality.

I’ve debugged systems where the “failure signature” disappeared entirely because preprocessing aligned data by wall-clock time instead of machine state. The temperature sensor and vibration sensor were running on separate clocks, drifting opposite directions, and by the time we resampled to a common timebase, the 200 ms temperature spike that preceded the vibration fault was 3 seconds off and got averaged out.

The fix: trigger-based collection with synchronized clocks

Instead of logging continuously, log strategically. Most machines spend most of their time in steady state — you don’t need 10 kHz data to know a motor is running normally. What you need is:

  1. Baseline snapshots (low rate, 10 Hz): continuous monitoring to detect state changes
  2. High-resolution bursts (10 kHz): triggered when baseline crosses a threshold
  3. Hardware-synchronized timestamps: GPS, PTP, or at minimum NTP with millisecond accuracy

Here’s a revised logger:

import numpy as np
import pandas as pd
from collections import deque
from datetime import datetime

class TriggeredLogger:
    def __init__(self, baseline_hz=10, burst_hz=10000, 
                 burst_duration=2.0, trigger_threshold=2.0):
        self.baseline_rate = baseline_hz
        self.burst_rate = burst_hz
        self.burst_len = int(burst_duration * burst_hz)
        self.threshold = trigger_threshold  # m/s² RMS

        # Ring buffer for pre-trigger context
        self.pretrigger = deque(maxlen=int(baseline_hz * 5))  # 5 sec history

    def monitor(self, duration_sec=3600):
        """Run monitoring loop for specified duration."""
        bursts = []
        baseline_log = []

        start = datetime.now()
        interval = 1.0 / self.baseline_rate

        for i in range(int(duration_sec * self.baseline_rate)):
            timestamp = start + pd.Timedelta(seconds=i * interval)

            # Read baseline (downsampled or RMS windowed)
            vib_rms = self._read_baseline_vibration()
            temp = self._read_temperature()

            baseline_log.append({
                'timestamp': timestamp,
                'vib_rms': vib_rms,
                'temperature': temp
            })
            self.pretrigger.append((timestamp, vib_rms, temp))

            # Trigger detection
            if vib_rms > self.threshold:
                print(f"Trigger at {timestamp}: {vib_rms:.2f} m/s² RMS")
                burst = self._capture_burst(timestamp)
                bursts.append(burst)

        return pd.DataFrame(baseline_log), bursts

    def _capture_burst(self, trigger_time):
        """High-resolution capture around trigger event."""
        # In real system: switch DAQ to high-rate mode
        # Capture includes pre-trigger buffer from ring
        burst_data = []

        # Include 0.5 sec pre-trigger from ring buffer (if available)
        pretrig_samples = list(self.pretrigger)[-int(self.baseline_rate * 0.5):]

        # Capture 2 sec at burst rate
        for i in range(self.burst_len):
            t = trigger_time + pd.Timedelta(seconds=i / self.burst_rate)
            vib_raw = self._read_vibration_raw()
            burst_data.append({'timestamp': t, 'vibration': vib_raw})

        return pd.DataFrame(burst_data)

    def _read_baseline_vibration(self):
        # RMS over 0.1 sec window (simulated)
        samples = np.random.normal(0, 0.5, size=1000)  # healthy baseline
        # Occasionally inject transient
        if np.random.random() < 0.001:  # 0.1% chance
            samples += np.random.normal(5, 2, size=1000)  # fault spike
        return np.sqrt(np.mean(samples**2))

    def _read_vibration_raw(self):
        return np.random.normal(0, 0.5)

    def _read_temperature(self):
        # Bearing temperature, °C
        return 45 + np.random.normal(0, 2)

This approach drops storage from 7 GB/day to maybe 50 MB/day (depends on trigger frequency). A typical healthy machine might trigger 10-20 times per shift on startup transients or load changes. A degrading bearing might trigger 100+ times. That’s the signal you want.

Preprocessing: resampling, alignment, and the curse of NaNs

Once you’ve got data, you need to align multi-rate sensors onto a common timeline. Temperature at 1 Hz, vibration RMS at 10 Hz, high-res bursts at 10 kHz — they all need to merge for feature engineering.

The naive approach is df.resample('100ms').mean(), which works until you hit edge cases:

  • Forward-fill vs interpolation: Temperature changes slowly, so forward-fill makes sense. Vibration peaks get destroyed by averaging. You need context-aware resampling.
  • Burst alignment: High-res bursts aren’t on regular intervals. You can’t just resample them — you need to extract features (RMS, peak, kurtosis) first, then align those features.
  • Missing data: Sensors fail. Cables get unplugged. Your pipeline needs to handle gaps without silently filling them with garbage.

Here’s a preprocessor that handles these cases:

class CBMPreprocessor:
    def __init__(self, target_rate='100ms'):
        self.target_rate = target_rate

    def align_sensors(self, baseline_df, bursts):
        """Merge baseline and burst-derived features onto common timeline."""

        # Resample baseline (forward-fill temp, interpolate vibration RMS)
        baseline_df = baseline_df.set_index('timestamp')
        resampled = baseline_df.resample(self.target_rate).agg({
            'vib_rms': 'mean',  # average RMS over window
            'temperature': 'ffill'  # carry forward last reading
        })

        # Extract features from bursts and merge
        burst_features = self._extract_burst_features(bursts)
        if not burst_features.empty:
            # Merge on nearest timestamp (within tolerance)
            merged = pd.merge_asof(
                resampled.reset_index(),
                burst_features,
                on='timestamp',
                direction='nearest',
                tolerance=pd.Timedelta('1s')
            )
        else:
            merged = resampled.reset_index()
            merged['burst_peak'] = np.nan
            merged['burst_kurtosis'] = np.nan

        # Quality flags: mark interpolated regions
        merged['temp_interpolated'] = merged['temperature'].isna()
        merged['temperature'].fillna(method='ffill', inplace=True)

        # Drop rows with critical missing data
        merged = merged.dropna(subset=['vib_rms'])

        return merged

    def _extract_burst_features(self, bursts):
        """Compute statistical features from high-res bursts."""
        if not bursts:
            return pd.DataFrame()

        features = []
        for burst_df in bursts:
            if burst_df.empty:
                continue
            timestamp = burst_df['timestamp'].iloc[len(burst_df)//2]  # mid-burst
            vib = burst_df['vibration'].values

            # Time-domain features
            peak = np.max(np.abs(vib))
            rms = np.sqrt(np.mean(vib**2))
            kurtosis = self._kurtosis(vib)

            # Frequency-domain features (FFT for bearing fault detection)
            fft_mag = np.abs(np.fft.rfft(vib))
            freq = np.fft.rfftfreq(len(vib), d=1/10000)

            # Bearing defect bands (example: 100-500 Hz for outer race)
            # This is domain-specific; adjust for your machinery
            bearing_band = fft_mag[(freq > 100) & (freq < 500)]
            bearing_energy = np.sum(bearing_band**2)

            features.append({
                'timestamp': timestamp,
                'burst_peak': peak,
                'burst_rms': rms,
                'burst_kurtosis': kurtosis,
                'bearing_band_energy': bearing_energy
            })

        return pd.DataFrame(features)

    def _kurtosis(self, x):
        """Kurtosis (fourth moment): sensitive to transients."""
        # K = E[(X - μ)^4] / σ^4
        mean = np.mean(x)
        std = np.std(x)
        if std == 0:  # shouldn't happen, but guard anyway
            return 0
        return np.mean(((x - mean) / std) ** 4)

Kurtosis K=E[(Xμ)4]σ4K = \frac{E[(X – \mu)^4]}{\sigma^4} is gold for fault detection. Healthy vibration is roughly Gaussian (K3K \approx 3). Bearing spalls create impulses that spike kurtosis to 10+. It’s one of the first features to degrade before RMS catches up.

Bearing defect energy is even more targeted. Rolling element bearings produce harmonics at specific frequencies based on geometry and RPM. For a typical motor at 1800 RPM with a bearing that has a Ball Pass Frequency Outer race (BPFO) defect, you’d see peaks around:

fBPFO=Nb2fr(1dDcosα)f_{\text{BPFO}} = \frac{N_b}{2} f_r \left(1 – \frac{d}{D} \cos\alpha\right)

where NbN_b is number of balls, frf_r is shaft frequency (30 Hz for 1800 RPM), d/Dd/D is ball-to-pitch diameter ratio, and α\alpha is contact angle. For a typical bearing, this lands around 100-200 Hz. Monitoring energy in that band beats looking at broadband RMS.

Data quality: the silent killer

You can have perfect algorithms and still fail if your data quality is bad. Here are the gotchas I’ve hit:

Sensor mounting: An accelerometer that’s slightly loose will read lower than reality and add noise. In one case, a magnetic mount lost 20% grip due to bearing heat, and the “degradation signature” was actually the sensor sliding around. We only caught it by cross-checking with a second sensor.

Electrical noise: Motors generate EMI. Unshielded cables pick it up. You’ll see 60 Hz (or 50 Hz in Europe) spikes in your FFT that have nothing to do with mechanical condition. Solution: twisted-pair shielded cables, grounded properly, and notch filters in preprocessing. Or just ignore the 60 Hz bin in your features.

Temperature lag: Thermocouples have thermal mass. A bearing temperature spike takes 5-10 seconds to show up in your sensor reading. If you’re correlating with vibration events, account for that lag or you’ll miss the signature. I’ve seen people try to predict temperature from vibration and get terrible results because they didn’t shift the temperature signal back in time.

Sampling jitter: Most DAQ systems aren’t hard real-time. Your “10 kHz” sampling might actually be 9.8-10.2 kHz depending on CPU load. Over a 2-second burst, that’s 40 samples of jitter. For FFT-based features, that smears frequency bins. If you’re doing precise fault frequency detection, you need either hardware-timed sampling or post-processing to correct jitter (resample to exact intervals via interpolation).

What I wish I’d known earlier

Start with less data, higher quality. One week of clean, well-labeled, synchronized data from a machine with known faults is worth a year of noisy unstructured logs.

Log your preprocessing pipeline version alongside your data. When you change your resampling method or add a new feature six months in, you need to know which models were trained on which preprocessing version. I’ve seen teams retrain models on subtly different data and wonder why performance dropped.

And test your sensor failures explicitly. Unplug a cable mid-run and make sure your pipeline doesn’t silently forward-fill 10 minutes of NaNs and call it healthy. Inject synthetic dropouts and verify your quality flags trigger.

I’m still not entirely sure what the right balance is between edge processing (compute features on the device, send only summaries) versus cloud processing (send raw bursts, centralize feature engineering). Edge saves bandwidth and lets you react faster, but cloud gives you flexibility to retroactively reprocess data when you discover a better feature. For this project, I’m doing hybrid: baseline + burst features on edge, raw bursts uploaded for model development.

Next: turning signals into health indicators

You’ve got clean, aligned, quality-flagged data. Next part covers feature engineering: extracting degradation-sensitive features from vibration and temperature, labeling data with Remaining Useful Life (RUL) targets, and handling the class imbalance problem (healthy data vastly outnumbers fault data). We’ll also get into domain-specific features like envelope analysis and cepstrum-based gear mesh detection — techniques that make the difference between a model that works in a lab and one that works in production.

CBM Portfolio Project Series (1/4)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 396 | TOTAL 2,619