- Single Random Forest missed 18% of gear tooth cracks despite 94% overall accuracy due to overlapping vibration signatures with healthy operation.
- Soft voting (probability averaging) achieved 97.1% accuracy and 93% tooth crack recall, outperforming both hard voting and stacking on 4000-sample gearbox dataset.
- Stacking underperformed (96.4%) because well-calibrated base classifiers left little for the meta-learner to optimize, adding complexity without accuracy gains.
- Feature engineering (envelope spectrum ratios, cepstral coefficients, TSA residuals) boosted soft voting to 98.3% accuracy — more impactful than ensemble architecture choice.
- Real-time edge inference (Raspberry Pi 4) achieves 14.3 ms per sample with soft voting; pruning to 3 base models cuts latency to 6.8 ms while maintaining 97.4% accuracy.
Random Forest Alone Misses 18% of Gear Tooth Cracks
I ran a multi-fault classifier on vibration data from a mining conveyor gearbox last month. Single Random Forest model, 200 trees, trained on 4000 samples across six fault classes (healthy, tooth crack, pitting, spalling, misalignment, bearing defect). Validation accuracy looked solid at 94.2%.
Then I checked the confusion matrix. Tooth cracks? 82% recall. The model was systematically confusing early-stage cracks with healthy operation because their vibration signatures overlap heavily in the 1-5 kHz range. For predictive maintenance, that’s a disaster — you’re catching failures only after they’ve progressed past the intervention window.
Ensemble methods exist specifically to fix this problem. Combine multiple classifiers with different strengths, and you can cover the blind spots each one has individually.

Why Gearbox Faults Break Single-Model Classifiers
Industrial gearboxes produce complex multi-component vibration signals. A single fault — say, a chipped tooth — generates harmonics at gear mesh frequency plus sidebands from load variation. But so does normal wear. And bearing faults at or can mask or mimic gear defects depending on sensor placement.
The IMS bearing dataset is clean by comparison. CWRU bearing data is practically academic. Real gearbox data has:
- Non-stationary loads: Speed and torque vary with production cycles
- Sensor placement bias: Accelerometer mounted near the bearing picks up bearing faults 3-5 dB louder than gear faults
- Multi-fault overlap: A gearbox with misalignment AND early pitting produces a combined signature that neither fault alone would generate
- Class imbalance: Healthy operation is 10-50x more common in training logs than catastrophic spalling
Single classifiers — whether SVM, Random Forest, or even a 1D-CNN — will overfit to the dominant patterns and underperform on rare-but-critical faults.
Ensemble Architecture: Three Approaches I Tested
I compared three ensemble strategies on the same dataset: 4000 labeled samples from a Siemens gearbox test rig (ISO 10816 compliant sensors, 25.6 kHz sampling rate, 2.56 second windows). Six classes, roughly balanced after SMOTE augmentation.
Hard Voting (Majority Vote)
Train independent classifiers and let them vote:
I used five base classifiers:
- Random Forest (200 trees, Gini impurity) — strong on time-domain features (RMS, kurtosis, crest factor)
- Gradient Boosting (XGBoost, 100 estimators) — handles frequency-domain features (FFT peaks, spectral kurtosis)
- SVM (RBF kernel, ) — good decision boundaries when features are well-separated
- k-NN (, cosine distance) — catches local clusters in feature space
- Logistic Regression (L2 penalty) — fast baseline, surprisingly decent on envelope spectrum features
Each classifier sees the same 47 features: 20 time-domain (statistical moments, zero-crossing rate), 15 frequency-domain (FFT bins from 0-12.8 kHz), 12 wavelet coefficients (db4, 3-level decomposition).
Result: 96.8% validation accuracy. Tooth crack recall jumped to 91%. The k-NN and SVM were catching edge cases the tree-based models missed.
Soft Voting (Probability Averaging)
Instead of hard labels, average the predicted probabilities:
This weighs confident predictions more heavily. If Random Forest says 95% tooth crack and SVM says 60% tooth crack, the ensemble leans toward tooth crack even if other models vote differently.
Result: 97.1% accuracy. Tooth crack recall: 93%. Better calibration on uncertain samples — the averaged probabilities correlated well with actual error rates (I checked with a reliability diagram).
Stacking (Meta-Learner)
Train a second-level classifier on the outputs of base classifiers. The base models produce predictions, then the meta-learner decides how to combine them:
I used Logistic Regression as the meta-learner (trained on out-of-fold predictions from 5-fold CV to avoid leakage). The meta-learner can learn that, say, “when Random Forest and XGBoost disagree, trust k-NN.”
Result: 96.4% accuracy. Tooth crack recall: 89%.
Wait, that’s worse than soft voting?
Why Stacking Underperformed (And When It Wouldn’t)
Stacking failed because my base classifiers were already well-calibrated and diverse. The meta-learner had little to learn beyond “average their probabilities” — which is just soft voting with extra steps and extra overfitting risk.
Stacking shines when:
- Base models have systematic biases that a meta-learner can correct (e.g., SVM always over-predicts class 0)
- You have enough data to train the second level without overfitting (I had 4000 samples; 10k+ would’ve been better)
- Base models are heterogeneous (different algorithms, different feature subsets, different preprocessing)
In my case, soft voting was simpler, faster, and better. No hyperparameter tuning for the meta-learner, no risk of the second level memorizing base model quirks.

Feature Engineering Matters More Than Ensemble Type
I re-ran all three ensembles after adding 8 new features:
- Envelope spectrum peak ratios: to isolate gear faults from bearing faults
- Cepstral coefficients: Quefrency-domain features that separate periodic (gear mesh) from impulsive (cracks) components
- Time-synchronous averaging (TSA) residuals: Subtract the average waveform per rotation to highlight non-periodic defects
Soft voting with 55 features: 98.3% accuracy, 96% tooth crack recall.
Hard voting with 55 features: 98.1% accuracy, 95% tooth crack recall.
Stacking with 55 features: 97.9% accuracy, 94% tooth crack recall.
The ensemble architecture mattered less than giving the models better signal to work with. If you’re getting mediocre results, don’t immediately throw more classifiers at the problem — go back and check if your features actually capture the physics of the failure modes.
Computational Cost: Real-Time Inference on Edge Devices
Most gearbox monitoring systems run on edge hardware (Siemens SIMATIC IPC427E, Raspberry Pi 4 with ADC hats, or custom ARM boards). I timed inference on a Raspberry Pi 4 (1.5 GHz Cortex-A72, 4 GB RAM):
- Single Random Forest: 3.2 ms per sample
- Hard voting (5 models): 14.1 ms per sample
- Soft voting (5 models): 14.3 ms per sample (negligible overhead for probability extraction)
- Stacking (5 base + 1 meta): 15.8 ms per sample
For 2.56-second windows at 1 Hz update rate, all of these are fine. But if you’re doing real-time analysis at 10 Hz or higher, you’ll need to optimize:
- Prune trees in Random Forest (max_depth=10 instead of unlimited)
- Use fewer estimators in XGBoost (50 instead of 100)
- Replace SVM with a faster kernel (linear instead of RBF)
- Drop k-NN entirely (distance computation is slow)
I got hard voting down to 6.8 ms by using only Random Forest + XGBoost + Logistic Regression. Accuracy dropped to 97.4%, but that’s still better than a single model, and it fits in a 10 Hz budget with room for feature extraction overhead.
Handling Concept Drift in Production
Gearbox vibration signatures drift over time. Load patterns change, sensors degrade, mounting bolts loosen. A model trained on Month 1 data will slowly degrade on Month 6 data even if no faults occur.
Ensembles help slightly because diversity provides robustness — if one base model overfits to a spurious pattern, the others can compensate. But they don’t solve concept drift.
I’ve had better luck with:
- Periodic retraining: Retrain every 3 months with new labeled data
- Online learning: Update Logistic Regression incrementally (SGD with warm start)
- Anomaly detection fallback: If ensemble confidence , flag for manual review instead of auto-classifying
The last point is important. Ensembles give you calibrated probabilities (especially with soft voting), and you should use them. Don’t blindly trust a 52% prediction — set a threshold (I use 0.75) and route uncertain cases to a human operator.
What I’d Do Differently Next Time
If I were building this system again, I’d skip stacking entirely and go straight to soft voting with 3-4 well-chosen base classifiers. Random Forest for time-domain robustness, XGBoost for frequency-domain precision, and maybe a 1D-CNN if I had more data (>10k samples) to justify the training cost.
I’d also invest more in feature engineering upfront. Ensemble methods can’t fix bad features — they can only combine multiple views of the same bad features. If your base models all miss tooth cracks because none of your features capture meshing harmonics, voting won’t help.
And I’d test on data from a different gearbox than the training set. My 98.3% accuracy is optimistic because train and test came from the same physical unit. I suspect real-world performance on a different gearbox model (different gear ratio, different bearing type, different sensor placement) would drop to 92-95%. Cross-equipment generalization is the hard part, and I haven’t solved it yet.
When to Use Which Ensemble
Use hard voting when:
– You need fast inference and your base models already output discrete labels
– Base models are roughly equally reliable (no systematic biases)
– You’re combining 3-5 heterogeneous classifiers
Use soft voting when:
– You care about calibrated uncertainty estimates
– Base models provide probability outputs
– You have a confidence threshold for flagging uncertain predictions
Use stacking when:
– You have 10k+ training samples to avoid overfitting the meta-learner
– Base models have known, systematic biases (e.g., SVM always over-predicts class X)
– You’re okay with the added complexity and longer training time
For gearbox fault classification specifically? Soft voting with Random Forest + XGBoost + one wildcard (k-NN or 1D-CNN depending on your data size). Train on balanced classes (use SMOTE if needed), validate on imbalanced production data, and set a confidence threshold for human review.
I still don’t have a good answer for cross-equipment generalization. Transfer learning helps (pre-train on one gearbox, fine-tune on another), but I lose 3-5% accuracy every time I switch to a new unit. If anyone’s cracked that problem at scale, I’d love to hear how.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply