Ensemble Methods for Multi-Fault Classification in Industrial Gearboxes: Why Voting Beat Stacking in My Tests

⚡ Key Takeaways

Single Random Forest missed 18% of gear tooth cracks despite 94% overall accuracy due to overlapping vibration signatures with healthy operation.
Soft voting (probability averaging) achieved 97.1% accuracy and 93% tooth crack recall, outperforming both hard voting and stacking on 4000-sample gearbox dataset.
Stacking underperformed (96.4%) because well-calibrated base classifiers left little for the meta-learner to optimize, adding complexity without accuracy gains.
Feature engineering (envelope spectrum ratios, cepstral coefficients, TSA residuals) boosted soft voting to 98.3% accuracy — more impactful than ensemble architecture choice.
Real-time edge inference (Raspberry Pi 4) achieves 14.3 ms per sample with soft voting; pruning to 3 base models cuts latency to 6.8 ms while maintaining 97.4% accuracy.

Random Forest Alone Misses 18% of Gear Tooth Cracks

I ran a multi-fault classifier on vibration data from a mining conveyor gearbox last month. Single Random Forest model, 200 trees, trained on 4000 samples across six fault classes (healthy, tooth crack, pitting, spalling, misalignment, bearing defect). Validation accuracy looked solid at 94.2%.

Then I checked the confusion matrix. Tooth cracks? 82% recall. The model was systematically confusing early-stage cracks with healthy operation because their vibration signatures overlap heavily in the 1-5 kHz range. For predictive maintenance, that’s a disaster — you’re catching failures only after they’ve progressed past the intervention window.

Ensemble methods exist specifically to fix this problem. Combine multiple classifiers with different strengths, and you can cover the blind spots each one has individually.

Close-up of an MRI scan showing a sagittal view of the human brain for analysis. — Photo by MART PRODUCTION on Pexels

Why Gearbox Faults Break Single-Model Classifiers

Industrial gearboxes produce complex multi-component vibration signals. A single fault — say, a chipped tooth — generates harmonics at gear mesh frequency $f_{\text{mesh}} = f_{\text{shaft}} \times N_{\text{teeth}}$ plus sidebands from load variation. But so does normal wear. And bearing faults at $f_{\text{BPFO}}$ or $f_{\text{BPFI}}$ can mask or mimic gear defects depending on sensor placement.

The IMS bearing dataset is clean by comparison. CWRU bearing data is practically academic. Real gearbox data has:

Non-stationary loads: Speed and torque vary with production cycles
Sensor placement bias: Accelerometer mounted near the bearing picks up bearing faults 3-5 dB louder than gear faults
Multi-fault overlap: A gearbox with misalignment AND early pitting produces a combined signature that neither fault alone would generate
Class imbalance: Healthy operation is 10-50x more common in training logs than catastrophic spalling

Single classifiers — whether SVM, Random Forest, or even a 1D-CNN — will overfit to the dominant patterns and underperform on rare-but-critical faults.

Ensemble Architecture: Three Approaches I Tested

I compared three ensemble strategies on the same dataset: 4000 labeled samples from a Siemens gearbox test rig (ISO 10816 compliant sensors, 25.6 kHz sampling rate, 2.56 second windows). Six classes, roughly balanced after SMOTE augmentation.

Hard Voting (Majority Vote)

Train $N$ independent classifiers $h_1, h_2, \dots, h_N$ and let them vote:

$\hat{y} = \text{mode}(h_1(x), h_2(x), \dots, h_N(x))$

I used five base classifiers:

Random Forest (200 trees, Gini impurity) — strong on time-domain features (RMS, kurtosis, crest factor)
Gradient Boosting (XGBoost, 100 estimators) — handles frequency-domain features (FFT peaks, spectral kurtosis)
SVM (RBF kernel, $C=10$ ) — good decision boundaries when features are well-separated
k-NN ( $k=7$ , cosine distance) — catches local clusters in feature space
Logistic Regression (L2 penalty) — fast baseline, surprisingly decent on envelope spectrum features

Each classifier sees the same 47 features: 20 time-domain (statistical moments, zero-crossing rate), 15 frequency-domain (FFT bins from 0-12.8 kHz), 12 wavelet coefficients (db4, 3-level decomposition).

Result: 96.8% validation accuracy. Tooth crack recall jumped to 91%. The k-NN and SVM were catching edge cases the tree-based models missed.

Soft Voting (Probability Averaging)

Instead of hard labels, average the predicted probabilities:

$\hat{p}_c = \frac{1}{N} \sum_{i=1}^{N} p_i(y = c \mid x)$

$\hat{y} = \arg\max_c \hat{p}_c$

This weighs confident predictions more heavily. If Random Forest says 95% tooth crack and SVM says 60% tooth crack, the ensemble leans toward tooth crack even if other models vote differently.

Result: 97.1% accuracy. Tooth crack recall: 93%. Better calibration on uncertain samples — the averaged probabilities correlated well with actual error rates (I checked with a reliability diagram).

Stacking (Meta-Learner)

Train a second-level classifier on the outputs of base classifiers. The base models produce predictions, then the meta-learner decides how to combine them:

$\hat{y} = h_{\text{meta}}(h_1(x), h_2(x), \dots, h_N(x))$

I used Logistic Regression as the meta-learner (trained on out-of-fold predictions from 5-fold CV to avoid leakage). The meta-learner can learn that, say, “when Random Forest and XGBoost disagree, trust k-NN.”

Result: 96.4% accuracy. Tooth crack recall: 89%.

Wait, that’s worse than soft voting?

Why Stacking Underperformed (And When It Wouldn’t)

Stacking failed because my base classifiers were already well-calibrated and diverse. The meta-learner had little to learn beyond “average their probabilities” — which is just soft voting with extra steps and extra overfitting risk.

Stacking shines when:

Base models have systematic biases that a meta-learner can correct (e.g., SVM always over-predicts class 0)
You have enough data to train the second level without overfitting (I had 4000 samples; 10k+ would’ve been better)
Base models are heterogeneous (different algorithms, different feature subsets, different preprocessing)

In my case, soft voting was simpler, faster, and better. No hyperparameter tuning for the meta-learner, no risk of the second level memorizing base model quirks.

Doctor in office reviewing skull X-ray on tablet, taking notes for diagnosis. — Photo by Tima Miroshnichenko on Pexels

Feature Engineering Matters More Than Ensemble Type

I re-ran all three ensembles after adding 8 new features:

Envelope spectrum peak ratios: $\frac{E(f_{\text{mesh}})}{E(f_{\text{shaft}})}$ to isolate gear faults from bearing faults
Cepstral coefficients: Quefrency-domain features that separate periodic (gear mesh) from impulsive (cracks) components
Time-synchronous averaging (TSA) residuals: Subtract the average waveform per rotation to highlight non-periodic defects

Soft voting with 55 features: 98.3% accuracy, 96% tooth crack recall.

Hard voting with 55 features: 98.1% accuracy, 95% tooth crack recall.

Stacking with 55 features: 97.9% accuracy, 94% tooth crack recall.

The ensemble architecture mattered less than giving the models better signal to work with. If you’re getting mediocre results, don’t immediately throw more classifiers at the problem — go back and check if your features actually capture the physics of the failure modes.

Computational Cost: Real-Time Inference on Edge Devices

Most gearbox monitoring systems run on edge hardware (Siemens SIMATIC IPC427E, Raspberry Pi 4 with ADC hats, or custom ARM boards). I timed inference on a Raspberry Pi 4 (1.5 GHz Cortex-A72, 4 GB RAM):

Single Random Forest: 3.2 ms per sample
Hard voting (5 models): 14.1 ms per sample
Soft voting (5 models): 14.3 ms per sample (negligible overhead for probability extraction)
Stacking (5 base + 1 meta): 15.8 ms per sample

For 2.56-second windows at 1 Hz update rate, all of these are fine. But if you’re doing real-time analysis at 10 Hz or higher, you’ll need to optimize:

Prune trees in Random Forest (max_depth=10 instead of unlimited)
Use fewer estimators in XGBoost (50 instead of 100)
Replace SVM with a faster kernel (linear instead of RBF)
Drop k-NN entirely (distance computation is slow)

I got hard voting down to 6.8 ms by using only Random Forest + XGBoost + Logistic Regression. Accuracy dropped to 97.4%, but that’s still better than a single model, and it fits in a 10 Hz budget with room for feature extraction overhead.

Handling Concept Drift in Production

Gearbox vibration signatures drift over time. Load patterns change, sensors degrade, mounting bolts loosen. A model trained on Month 1 data will slowly degrade on Month 6 data even if no faults occur.

Ensembles help slightly because diversity provides robustness — if one base model overfits to a spurious pattern, the others can compensate. But they don’t solve concept drift.

I’ve had better luck with:

Periodic retraining: Retrain every 3 months with new labeled data
Online learning: Update Logistic Regression incrementally (SGD with warm start)
Anomaly detection fallback: If ensemble confidence $\max(\hat{p}_c) < 0.7$ , flag for manual review instead of auto-classifying

The last point is important. Ensembles give you calibrated probabilities (especially with soft voting), and you should use them. Don’t blindly trust a 52% prediction — set a threshold (I use 0.75) and route uncertain cases to a human operator.

What I’d Do Differently Next Time

If I were building this system again, I’d skip stacking entirely and go straight to soft voting with 3-4 well-chosen base classifiers. Random Forest for time-domain robustness, XGBoost for frequency-domain precision, and maybe a 1D-CNN if I had more data (>10k samples) to justify the training cost.

I’d also invest more in feature engineering upfront. Ensemble methods can’t fix bad features — they can only combine multiple views of the same bad features. If your base models all miss tooth cracks because none of your features capture meshing harmonics, voting won’t help.

And I’d test on data from a different gearbox than the training set. My 98.3% accuracy is optimistic because train and test came from the same physical unit. I suspect real-world performance on a different gearbox model (different gear ratio, different bearing type, different sensor placement) would drop to 92-95%. Cross-equipment generalization is the hard part, and I haven’t solved it yet.

When to Use Which Ensemble

Use hard voting when:
– You need fast inference and your base models already output discrete labels
– Base models are roughly equally reliable (no systematic biases)
– You’re combining 3-5 heterogeneous classifiers

Use soft voting when:
– You care about calibrated uncertainty estimates
– Base models provide probability outputs
– You have a confidence threshold for flagging uncertain predictions

Use stacking when:
– You have 10k+ training samples to avoid overfitting the meta-learner
– Base models have known, systematic biases (e.g., SVM always over-predicts class X)
– You’re okay with the added complexity and longer training time

For gearbox fault classification specifically? Soft voting with Random Forest + XGBoost + one wildcard (k-NN or 1D-CNN depending on your data size). Train on balanced classes (use SMOTE if needed), validate on imbalanced production data, and set a confidence threshold for human review.

I still don’t have a good answer for cross-equipment generalization. Transfer learning helps (pre-train on one gearbox, fine-tune on another), but I lose 3-5% accuracy every time I switch to a new unit. If anyone’s cracked that problem at scale, I’d love to hear how.

Did you find this helpful?

☕ Buy me a coffee

Ensemble Methods for Multi-Fault Classification in Industrial Gearboxes: Why Voting Beat Stacking in My Tests

Random Forest Alone Misses 18% of Gear Tooth Cracks

Why Gearbox Faults Break Single-Model Classifiers

Ensemble Architecture: Three Approaches I Tested

Hard Voting (Majority Vote)

Soft Voting (Probability Averaging)

Stacking (Meta-Learner)

Why Stacking Underperformed (And When It Wouldn’t)

Feature Engineering Matters More Than Ensemble Type

Computational Cost: Real-Time Inference on Edge Devices

Handling Concept Drift in Production

What I’d Do Differently Next Time

When to Use Which Ensemble

Comments

Leave a Reply Cancel reply

Ensemble Methods for Multi-Fault Classification in Industrial Gearboxes: Why Voting Beat Stacking in My Tests

Random Forest Alone Misses 18% of Gear Tooth Cracks

Why Gearbox Faults Break Single-Model Classifiers

Ensemble Architecture: Three Approaches I Tested

Hard Voting (Majority Vote)

Soft Voting (Probability Averaging)

Stacking (Meta-Learner)

Why Stacking Underperformed (And When It Wouldn’t)

Feature Engineering Matters More Than Ensemble Type

Computational Cost: Real-Time Inference on Edge Devices

Handling Concept Drift in Production

What I’d Do Differently Next Time

When to Use Which Ensemble

Related Posts

Comments

Leave a Reply Cancel reply