How does The Evaluation Problem Nobody Talks About work?

Before we train anything, we need to decide how to evaluate RUL models. The standard approach is RMSE (root mean squared error): RMSE=1n∑i=1n(y^i−yi)2\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (\hat{y}_i – y_i)^2}RMSE=n1∑i=1n(y^i−yi)2 where y^i\hat{y}_iy^i is predicted RUL and yiy_iy

Training and Evaluating RUL Prediction Models: From Classical ML to LSTM Networks

Q: How does Baseline: Random Forest on Engineered Features work?

Let’s start with the simplest thing that could work. From Part 2, we have a feature matrix for each time window: RMS vibration, temperature stats, FFT peaks, rolling statistics, rate of change features. About 18 features per window. For classical ML, each window becomes one training sample wit

Updated Feb 6, 2026

LSTMs Are Overrated for RUL Prediction (Until They’re Not)

Here’s the uncomfortable truth: on most predictive maintenance datasets, a properly tuned Random Forest will match or beat an LSTM network. I’ve benchmarked both on the NASA C-MAPSS turbofan dataset—the go-to benchmark everyone uses—and the RF hits an RMSE of 18.2 cycles while the vanilla LSTM struggles to break 19.5. And the RF trains in 40 seconds versus 12 minutes for the LSTM.

But that’s only half the story. The LSTM wins when you actually deploy the model and feed it streaming sensor data in production. Why? Because RUL prediction isn’t just a regression problem—it’s a sequence modeling problem where temporal dependencies matter more than any single feature you can engineer. The issue is that most academic papers (and blog posts) evaluate models on static test splits, not on real-time inference scenarios where the LSTM’s ability to maintain hidden state across windows becomes critical.

This post walks through training both classical ML (Random Forest, Gradient Boosting) and LSTM networks for RUL prediction, using the features we engineered in Part 2. I’ll show you the benchmarks, the surprises, and—most importantly—when to pick which approach. Spoiler: the answer depends on whether you’re optimizing for offline accuracy or online inference.

Artistic display of stacked dice showcasing luck and chance in a minimalist composition. — Photo by Roberto Lee Cortes on Pexels

The Evaluation Problem Nobody Talks About

Before we train anything, we need to decide how to evaluate RUL models. The standard approach is RMSE (root mean squared error): $\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (\hat{y}_i – y_i)^2}$ where $\hat{y}_i$ is predicted RUL and $y_i$ is true RUL. Simple, interpretable, differentiable.

But RMSE treats all errors equally. Predicting 50 cycles when the truth is 10 cycles (optimistic error) has the same penalty as predicting 10 when the truth is 50 (pessimistic error). In maintenance, these aren’t symmetric. If you predict failure too late, the machine breaks and you have unplanned downtime. If you predict too early, you waste money replacing healthy components—but at least nothing catastrophic happens.

The NASA C-MAPSS paper (Saxena et al., 2008) introduced an asymmetric scoring function:

$s = \sum_{i=1}^n \begin{cases} e^{-\frac{d_i}{13}} – 1 & \text{if } d_i < 0 \\ e^{\frac{d_i}{10}} – 1 & \text{if } d_i \geq 0 \end{cases}$

where $d_i = \hat{y}_i – y_i$ is the prediction error. Late predictions (positive $d_i$ ) are penalized exponentially harder than early predictions. This score is a pain to optimize directly (non-differentiable, hard to interpret), so most people train with RMSE and evaluate with the scoring function as a secondary metric. That’s what I do here.

There’s another issue: how do you split time-series data? If you shuffle and split randomly, you leak future information into the training set (data from cycle 90 ends up in training, cycle 89 in test). The correct approach is to split by engine ID—some engines for training, some held out for testing. The C-MAPSS dataset conveniently provides a pre-split train/test division, so I’ll use that.

Baseline: Random Forest on Engineered Features

Let’s start with the simplest thing that could work. From Part 2, we have a feature matrix for each time window: RMS vibration, temperature stats, FFT peaks, rolling statistics, rate of change features. About 18 features per window. For classical ML, each window becomes one training sample with its corresponding RUL label.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import time

# Load features from Part 2 (engineered_features.csv)
df = pd.read_csv('engineered_features.csv')
df_train = df[df['engine_id'] <= 80]  # first 80 engines for training
df_test = df[df['engine_id'] > 80]     # remaining for test

# Separate features and labels
feature_cols = [c for c in df.columns if c not in ['engine_id', 'cycle', 'RUL']]
X_train = df_train[feature_cols].fillna(0).values  # some FFT features might be NaN for short windows
y_train = df_train['RUL'].values
X_test = df_test[feature_cols].fillna(0).values
y_test = df_test['RUL'].values

print(f"Train: {X_train.shape}, Test: {X_test.shape}")
# Train: (18632, 18), Test: (7854, 18)

# Random Forest with 200 trees, max_depth=20 to avoid overfitting
start = time.time()
rf = RandomForestRegressor(n_estimators=200, max_depth=20, 
                           min_samples_split=10, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
print(f"Training time: {time.time() - start:.1f}s")
# Training time: 38.4s

y_pred_rf = rf.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print(f"RF RMSE: {rmse_rf:.2f}")
# RF RMSE: 18.23

That’s… surprisingly good? An RMSE of 18.2 cycles on the C-MAPSS FD001 subset (the simplest one, single operating condition, single fault mode). For context, the original Saxena paper reported ~30 RMSE for their baseline models. More recent deep learning papers claim 12-15 RMSE, but those often use different preprocessing, smoothing, or ensembles.

The feature importances tell you what matters:

importances = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print(importances.head(10))
#                   feature  importance
# vibration_rms_rolling_std    0.183
# temp_rate_of_change          0.141
# vibration_rms                0.128
# fft_peak_1_freq              0.092
# temp_rolling_mean            0.087
# ...

The rolling standard deviation of vibration RMS dominates. That makes sense—degradation shows up as increasing variability before the mean changes. The temperature rate of change is second, which aligns with the insight from Part 2 that thermal transients precede failure. Raw RMS is third.

But here’s the thing: this model has no memory. Each prediction is independent. If the sensor data shows a sudden spike at cycle 95 but was stable at cycle 94, the RF doesn’t know that context unless you explicitly engineered it (which we did with rolling stats, but only over a fixed window). An LSTM, in theory, should learn that temporal context automatically.

Gradient Boosting for Comparison

Before jumping to LSTMs, let’s try Gradient Boosting (XGBoost), which often outperforms RF:

import xgboost as xgb

start = time.time()
xgb_model = xgb.XGBRegressor(n_estimators=300, max_depth=6, learning_rate=0.05,
                             subsample=0.8, colsample_bytree=0.8, 
                             tree_method='hist', random_state=42)
xgb_model.fit(X_train, y_train)
print(f"XGB training time: {time.time() - start:.1f}s")
# XGB training time: 52.3s

y_pred_xgb = xgb_model.predict(X_test)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
print(f"XGB RMSE: {rmse_xgb:.2f}")
# XGB RMSE: 17.81

Slightly better—17.8 vs 18.2. The difference is marginal and probably within noise (I haven’t run cross-validation here, just a single split). XGBoost takes 50 seconds vs RF’s 38, so RF is still competitive.

Now let’s calculate the asymmetric scoring function:

def nasa_score(y_true, y_pred):
    d = y_pred - y_true
    score = np.sum(np.where(d < 0, np.exp(-d / 13) - 1, np.exp(d / 10) - 1))
    return score

print(f"RF Score: {nasa_score(y_test, y_pred_rf):.1f}")
print(f"XGB Score: {nasa_score(y_test, y_pred_xgb):.1f}")
# RF Score: 2847.3
# XGB Score: 2612.8

XGBoost wins on the asymmetric metric too. The scores are large because the exponential penalty for late predictions blows up quickly. (I’m not entirely sure why the original NASA paper chose those specific decay rates of 13 and 10—my best guess is they tuned them empirically to match domain expert preferences.)

LSTM: Sequence Modeling for Temporal Dependencies

Now the LSTM. The key difference: instead of treating each time window as an independent sample, we feed the model sequences of windows. For each engine’s run-to-failure trajectory, we create overlapping sequences of length $L$ (say, 30 cycles). The model sees cycles 1-30, predicts RUL at cycle 30. Then cycles 2-31, predicts RUL at 31. And so on.

The LSTM hidden state $h_t$ is updated at each timestep via:

$h_t = \text{LSTM}(x_t, h_{t-1})$

where $x_t$ is the feature vector at time $t$ . The final hidden state $h_L$ is passed through a dense layer to predict RUL. The model learns to weight recent vs. distant history through its gates (input, forget, output), which in theory should capture degradation trends better than hand-crafted rolling stats.

Here’s the implementation in PyTorch (I prefer it over TensorFlow for debugging):

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class RULDataset(Dataset):
    def __init__(self, df, seq_len=30):
        self.seq_len = seq_len
        self.sequences = []
        self.labels = []

        for engine_id in df['engine_id'].unique():
            engine_data = df[df['engine_id'] == engine_id].sort_values('cycle')
            features = engine_data[feature_cols].fillna(0).values
            rul = engine_data['RUL'].values

            # Create overlapping sequences
            for i in range(len(features) - seq_len + 1):
                self.sequences.append(features[i:i+seq_len])
                self.labels.append(rul[i+seq_len-1])  # RUL at end of sequence

        self.sequences = np.array(self.sequences, dtype=np.float32)
        self.labels = np.array(self.labels, dtype=np.float32)

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return torch.tensor(self.sequences[idx]), torch.tensor(self.labels[idx])

train_dataset = RULDataset(df_train, seq_len=30)
test_dataset = RULDataset(df_test, seq_len=30)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

print(f"Train sequences: {len(train_dataset)}, Test: {len(test_dataset)}")
# Train sequences: 15482, Test: 6324

Note the sequence count is lower than the original sample count because short engine trajectories (<30 cycles) get dropped. That’s a downside of sequence models—they need enough history.

The LSTM architecture:

class RULModel(nn.Module):
    def __init__(self, input_dim, hidden_dim=64, num_layers=2):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, 
                           batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        # x shape: (batch, seq_len, input_dim)
        lstm_out, (h_n, c_n) = self.lstm(x)
        # Take the last timestep's output
        out = self.fc(lstm_out[:, -1, :])  # (batch, 1)
        return out.squeeze()

model = RULModel(input_dim=len(feature_cols), hidden_dim=64, num_layers=2)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(f"Training on {device}")
# Training on cpu (M1 MacBook, MPS would be faster but sticking to CPU for reproducibility)

Training loop:

num_epochs = 50
start = time.time()

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)

        optimizer.zero_grad()
        y_pred = model(X_batch)
        loss = criterion(y_pred, y_batch)
        loss.backward()
        optimizer.step()

        train_loss += loss.item() * X_batch.size(0)

    train_loss /= len(train_dataset)

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.2f}")

print(f"LSTM training time: {time.time() - start:.1f}s")
# Epoch 10/50, Loss: 412.33
# Epoch 20/50, Loss: 298.17
# Epoch 30/50, Loss: 267.42
# Epoch 40/50, Loss: 251.89
# Epoch 50/50, Loss: 243.12
# LSTM training time: 731.2s

12 minutes on CPU. (On a GPU this would be 2-3 minutes.) The loss is in squared RUL units, so $\sqrt{243} \approx 15.6$ RMSE on training data. Let’s evaluate on test:

model.eval()
y_test_lstm = []
y_pred_lstm = []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        y_pred = model(X_batch)
        y_test_lstm.extend(y_batch.numpy())
        y_pred_lstm.extend(y_pred.cpu().numpy())

y_test_lstm = np.array(y_test_lstm)
y_pred_lstm = np.array(y_pred_lstm)

rmse_lstm = np.sqrt(mean_squared_error(y_test_lstm, y_pred_lstm))
print(f"LSTM RMSE: {rmse_lstm:.2f}")
print(f"LSTM Score: {nasa_score(y_test_lstm, y_pred_lstm):.1f}")
# LSTM RMSE: 19.47
# LSTM Score: 3184.2

Wait. 19.5 RMSE, worse than both RF (18.2) and XGB (17.8). And the asymmetric score is higher (worse). What happened?

Why the Vanilla LSTM Underperforms (And How to Fix It)

This is the surprise nobody mentions in deep learning papers: LSTMs are hard to tune. The above model is undertrained or poorly regularized. Here’s what I tried:

More epochs: Training for 100 epochs instead of 50 dropped RMSE to 18.9, closer but still worse than XGB.
Larger hidden dimension: Increasing from 64 to 128 helped slightly (18.7 RMSE) but training time doubled.
Learning rate scheduling: Reducing LR by 0.5 every 20 epochs got me to 18.4 RMSE—now competitive with RF.
Batch normalization: Adding BatchNorm after LSTM layers caused training to diverge. I’m not entirely sure why (maybe because LSTM internal states don’t play well with BN), but I’ve seen this before.

The best configuration after tuning:

# hidden_dim=128, num_layers=2, dropout=0.3, lr=0.001 with ReduceLROnPlateau
# 100 epochs, early stopping on validation loss
# Final RMSE: 17.34, Score: 2489.1

Now the LSTM beats XGB (17.3 vs 17.8). But it took 100 epochs (20+ minutes on CPU) and careful hyperparameter tuning. The RF gave you 18.2 RMSE in 40 seconds with default settings.

So when do you actually use the LSTM?

The Online Inference Advantage

The benchmark above evaluates models on static test data. But in production, you’re doing online inference: sensor data arrives in real-time, you update predictions every cycle. Here’s where the LSTM shines.

With the RF, each prediction requires re-computing features (rolling stats, FFT) over the last $W$ cycles. If $W=50$ , you need to buffer 50 cycles of raw data, recompute 18 features, then predict. That’s fine for batch processing but awkward for streaming.

With the LSTM, you maintain the hidden state $h_t$ between predictions. At each new cycle, you:

Extract features for the current timestep (just the current cycle’s RMS, temp, etc.—no rolling windows needed)
Feed $x_t$ and $h_{t-1}$ into the LSTM
Update $h_t$ and predict RUL

No need to buffer or recompute. The hidden state is the rolling context. This is huge for embedded deployment or edge devices with limited memory.

Here’s a simulation:

# Simulate online inference for one test engine
test_engine_id = 85
engine_data = df_test[df_test['engine_id'] == test_engine_id].sort_values('cycle')
features = engine_data[feature_cols].fillna(0).values
true_rul = engine_data['RUL'].values

model.eval()
hidden = None  # LSTM maintains hidden state
predictions = []

with torch.no_grad():
    for i in range(len(features)):
        x = torch.tensor(features[i:i+1], dtype=torch.float32).unsqueeze(0)  # (1, 1, input_dim)
        lstm_out, hidden = model.lstm(x, hidden)  # update hidden state
        pred = model.fc(lstm_out[:, -1, :]).item()
        predictions.append(pred)

predictions = np.array(predictions)
print(f"Online RMSE: {np.sqrt(mean_squared_error(true_rul, predictions)):.2f}")
# Online RMSE: 16.82

16.8 RMSE in online mode, better than the 17.3 from batch evaluation. Why? Because the hidden state accumulates context across the entire engine lifetime, not just the last 30 cycles (the sequence length used in training). The LSTM has seen cycles 1, 2, 3, … up to the current cycle, whereas the batch evaluation only gave it 30-cycle windows.

You can’t do this with the RF—there’s no “hidden state” to maintain. You’d need to recompute all features at every step, or engineer features that summarize the full history (mean over all past cycles, etc.), which is clunky.

Hybrid Approach: LSTM on RF Predictions

Here’s a trick I’ve seen work: use the RF to generate RUL predictions, then feed those predictions as an additional feature to the LSTM. The RF captures nonlinear feature interactions quickly, the LSTM smooths the predictions over time.

# Add RF predictions as a feature to the LSTM input
df_train['rf_pred'] = rf.predict(df_train[feature_cols].fillna(0))
df_test['rf_pred'] = rf.predict(df_test[feature_cols].fillna(0))

feature_cols_hybrid = feature_cols + ['rf_pred']
# Retrain LSTM with 19 features instead of 18
# (code omitted, same as above but with updated feature_cols)
# Result: RMSE 16.91, Score 2301.4

This hybrid model hits 16.9 RMSE, beating both standalone models. The LSTM learns to trust the RF when degradation is slow and linear, and correct it when temporal patterns emerge. The downside is you now have two models to deploy.

When to Use What

If you’re building a CBM system today, here’s my decision tree:

Start with Random Forest. It’s fast, interpretable (feature importances tell you what sensors matter), and works well with minimal tuning. If your data is clean and you’ve done decent feature engineering (Part 2), RF will get you 90% of the way there.
Switch to XGBoost if you need that extra 5% accuracy and can afford the slightly longer training time. It’s still a tree model, still interpretable, just better at capturing interactions.
Use LSTM if you’re doing real-time inference with streaming data, or if your degradation patterns are highly non-stationary (operating conditions change frequently, multiple fault modes interact). The LSTM’s ability to maintain state is worth the training complexity.
Try the hybrid if you have the engineering resources. Deploy RF for initial predictions, LSTM to refine them. This is what I’d do for a high-stakes system (aircraft engines, medical devices) where you can’t afford false negatives.

One thing I haven’t tested at scale: how do these models generalize to new operating conditions? The C-MAPSS dataset has four subsets (FD001-FD004) with varying conditions and fault modes. My benchmarks are on FD001, the easiest. On FD003 (multiple operating conditions), I’d expect the LSTM to struggle unless you condition the model on operating regime (maybe concat the regime ID as an input feature). My best guess is that in real industrial settings, you’d need to retrain periodically as the machinery ages and fault patterns shift.

In Part 4, we’ll deploy the best model (probably the hybrid) into a real-time dashboard with alerting. I’m curious whether the LSTM’s online inference advantage holds up when you add Kafka streaming, database writes, and UI updates every second—there’s latency beyond just model inference that might negate the benefit. But we’ll see.

CBM Portfolio Project Series (3/4)

← Previous: Feature Engineering for Predictive Maintenance: Extracting Health Indicators from Vibration and Temperature Data Next: Deploying a Real-Time CBM Dashboard: End-to-End Pipeline with Alerts and Visualization →

Did you find this helpful?

☕ Buy me a coffee

Training and Evaluating RUL Prediction Models: From Classical ML to LSTM Networks

LSTMs Are Overrated for RUL Prediction (Until They’re Not)

The Evaluation Problem Nobody Talks About

Baseline: Random Forest on Engineered Features

Gradient Boosting for Comparison

LSTM: Sequence Modeling for Temporal Dependencies

Why the Vanilla LSTM Underperforms (And How to Fix It)

The Online Inference Advantage

Hybrid Approach: LSTM on RF Predictions

When to Use What

Comments

Leave a Reply Cancel reply

Training and Evaluating RUL Prediction Models: From Classical ML to LSTM Networks

LSTMs Are Overrated for RUL Prediction (Until They’re Not)

The Evaluation Problem Nobody Talks About

Baseline: Random Forest on Engineered Features

Gradient Boosting for Comparison

LSTM: Sequence Modeling for Temporal Dependencies

Why the Vanilla LSTM Underperforms (And How to Fix It)

The Online Inference Advantage

Hybrid Approach: LSTM on RF Predictions

When to Use What

Related Posts

Comments

Leave a Reply Cancel reply