Technical Indicators and Feature Engineering with Pandas for Quant Trading

Updated Feb 6, 2026

The Feature That Broke My Backtest

Most technical indicators you’ll find in tutorials are wrong. Not mathematically wrong — they compute the right numbers — but wrong in a way that’ll silently destroy your backtest results. The problem is lookahead bias, and it’s so easy to introduce that I’d estimate half the “profitable” strategies on Medium wouldn’t survive a proper walk-forward test.

Here’s what a production-grade feature engineering pipeline actually looks like, and then we’ll work backwards to see where the naive approaches fall apart.

import pandas as pd
import numpy as np
from typing import Optional

def build_feature_matrix(df: pd.DataFrame, price_col: str = 'close') -> pd.DataFrame:
    """
    Build technical indicator features with proper alignment.
    Expects OHLCV DataFrame with DatetimeIndex, sorted ascending.
    """
    feat = pd.DataFrame(index=df.index)
    close = df[price_col]
    high = df['high']
    low = df['low']
    volume = df['volume']

    # Returns at multiple horizons
    for lag in [1, 5, 10, 21]:
        feat[f'ret_{lag}d'] = close.pct_change(lag)

    # Volatility (21-day rolling std of daily returns, annualized)
    daily_ret = close.pct_change()
    feat['volatility_21d'] = daily_ret.rolling(21, min_periods=15).std() * np.sqrt(252)

    # RSI — the version that actually matches TradingView
    feat['rsi_14'] = _rsi_ewm(close, period=14)

    # MACD
    ema12 = close.ewm(span=12, adjust=False).mean()
    ema26 = close.ewm(span=26, adjust=False).mean()
    feat['macd'] = ema12 - ema26
    feat['macd_signal'] = feat['macd'].ewm(span=9, adjust=False).mean()
    feat['macd_hist'] = feat['macd'] - feat['macd_signal']

    # Bollinger Band width (normalized)
    sma20 = close.rolling(20).mean()
    std20 = close.rolling(20).std()
    feat['bb_width'] = (2 * std20) / sma20
    feat['bb_pctb'] = (close - (sma20 - 2 * std20)) / (4 * std20)

    # Volume features
    feat['volume_sma_ratio'] = volume / volume.rolling(20).mean()
    feat['volume_zscore'] = (
        (volume - volume.rolling(60, min_periods=40).mean())
        / volume.rolling(60, min_periods=40).std()
    )

    # ATR (Average True Range)
    tr = pd.concat([
        high - low,
        (high - close.shift(1)).abs(),
        (low - close.shift(1)).abs()
    ], axis=1).max(axis=1)
    feat['atr_14'] = tr.ewm(span=14, adjust=False).mean()
    feat['atr_pct'] = feat['atr_14'] / close  # as percentage of price

    # Price position within N-day range
    for window in [10, 50]:
        roll_high = high.rolling(window).max()
        roll_low = low.rolling(window).min()
        denom = roll_high - roll_low
        # Guard against zero range (happens in illiquid stocks)
        feat[f'price_position_{window}d'] = np.where(
            denom < 1e-8, 0.5, (close - roll_low) / denom
        )

    # Shift ALL features by 1 day to prevent lookahead
    # This is the line that matters most in this entire function
    feat = feat.shift(1)

    return feat


def _rsi_ewm(close: pd.Series, period: int = 14) -> pd.Series:
    delta = close.diff()
    gain = delta.clip(lower=0)
    loss = -delta.clip(upper=0)
    # Wilder's smoothing = EWM with alpha=1/period
    avg_gain = gain.ewm(alpha=1/period, min_periods=period, adjust=False).mean()
    avg_loss = loss.ewm(alpha=1/period, min_periods=period, adjust=False).mean()
    rs = avg_gain / avg_loss
    # avg_loss can be zero if price only went up
    return 100 - (100 / (1 + rs))

That feat.shift(1) at the bottom? That single line is the difference between a strategy that looks like it prints money and one that actually works. Everything computed from today’s close gets aligned to tomorrow’s row, because in reality you can’t trade on the closing price at the moment of close — you’d be placing orders on the next open at best.

Close-up of a computer screen displaying cryptocurrency market trends and trading data. — Photo by AlphaTradeZone on Pexels

Why “Just Use TA-Lib” Isn’t the Answer

TA-Lib is fast. It’s battle-tested C code under the hood, and for a single indicator computation it’ll beat pure pandas by 10-50x. But here’s the thing: speed doesn’t matter much when you’re computing features once for a backtest over a few thousand daily bars.

What does matter is consistency in how you handle alignment and NaN propagation. TA-Lib functions return numpy arrays with the same length as input, padded with NaN at the front. That’s fine. But the moment you start mixing TA-Lib outputs with pandas rolling operations, you’re juggling two different NaN conventions, and bugs creep in at the edges.

import talib

# TA-Lib RSI
rsi_talib = talib.RSI(close_prices.values, timeperiod=14)

# pandas EWM RSI (our version above)
rsi_pandas = _rsi_ewm(close_prices, period=14)

# They don't match exactly for the first ~100 bars
print(np.max(np.abs(rsi_talib[14:100] - rsi_pandas.values[14:100])))
# Output: 2.847 (on SPY daily data)

That 2.8-point discrepancy in RSI during the warm-up period isn’t a bug per se — it’s a difference in initialization. TA-Lib uses a simple moving average for the first period bars, then switches to exponential smoothing. Our EWM version uses exponential from the start. After about 100 bars they converge to within 0.01. But if you’re running cross-sectional signals on hundreds of stocks, some of which have short histories, those warm-up differences can introduce a subtle selection bias.

I’d recommend picking one approach and sticking with it across your entire pipeline. If you go with pandas (which I prefer for readability and debugging), set min_periods explicitly on every rolling and ewm call. The default min_periods=1 for ewm is a trap — it’ll happily return a “14-day EMA” computed from a single data point.

The Feature Engineering Part That Actually Matters

Raw indicators are a starting point, not features. An RSI of 72 means something different for a momentum stock that’s been above 70 for three weeks versus a mean-reverting ETF that just spiked. The transformation from raw indicator to useful feature is where the actual alpha lives.

Normalization: Z-Scores vs. Rank vs. Percentile

For cross-sectional strategies (comparing stocks against each other), raw values are almost useless. An ATR of 3.5 on a $200 stock is nothing; on a$ 10 stock it’s enormous. Even percentage-based indicators like RSI suffer from distributional differences across assets.

The approach that works best in my experience is rolling z-score normalization:

$z_{t} = \frac{x_{t} – \bar{x}_{t,w}}{\sigma_{t,w}}$

where $\bar{x}_{t,w}$ and $\sigma_{t,w}$ are the rolling mean and standard deviation over a window $w$ . This centers each feature relative to its own recent history.

def rolling_zscore(series: pd.Series, window: int = 63) -> pd.Series:
    """63 trading days ≈ 1 quarter"""
    mu = series.rolling(window, min_periods=int(window * 0.7)).mean()
    sigma = series.rolling(window, min_periods=int(window * 0.7)).std()
    # Clamp extreme z-scores. Outliers beyond ±4 are almost always data errors
    # or one-off events you don't want driving your model.
    z = ((series - mu) / sigma).clip(-4, 4)
    return z

Why 63 days? A quarter gives you enough history to be statistically meaningful while still adapting to regime changes. Shorter windows (21 days) make features too noisy; longer ones (252 days) make them too slow to react. But honestly, this is one of those parameters where the exact value matters less than being consistent. Pick one, use it everywhere, and move on.

For tree-based models (random forests, gradient boosting), you can skip normalization entirely. They don’t care about scale. But if you’re feeding features into a linear model or neural network, z-scoring is non-negotiable.

Interaction Features

Some of the strongest signals come from combining indicators that individually look mediocre. The classic example: RSI alone has weak predictive power for next-day returns. Volume alone is mostly noise. But RSI divergence combined with a volume spike? That’s a setup that shows up in virtually every technical analysis textbook for a reason.

# Price making new 20-day high while RSI is NOT making new high = bearish divergence
price_at_high = (close >= high.rolling(20).max())
rsi_below_prev_high = feat['rsi_14'] < feat['rsi_14'].rolling(20).max()
feat['bearish_divergence'] = (price_at_high & rsi_below_prev_high).astype(int)

# Volume confirmation: signal is stronger when volume is above average
feat['divergence_vol_confirm'] = (
    feat['bearish_divergence'] * feat['volume_sma_ratio']
)

Why does this combination work better than either signal alone? My best guess is that high volume during a divergence indicates institutional participation — larger players unwinding positions while retail pushes price to new highs. But I’m not entirely sure this interpretation holds across all market regimes. The empirical edge is more reliable than the narrative.

Lagged Features and Autocorrelation

Financial returns exhibit very weak autocorrelation at the daily level — the serial correlation of S&P 500 daily returns is somewhere around 0.01 to 0.03 depending on the period you measure. But volatility is highly autocorrelated (this is the foundation of GARCH models, introduced by Bollerslev in 1986), and so are volume patterns.

Adding lagged features captures this temporal structure:

# Lagged volatility features
for lag in [1, 2, 3, 5]:
    feat[f'volatility_lag{lag}'] = feat['volatility_21d'].shift(lag)

# Rolling return autocorrelation (slow to compute — worth caching)
feat['ret_autocorr_21d'] = (
    daily_ret.rolling(21)
    .apply(lambda x: x.autocorr(lag=1), raw=False)
)

That .apply(lambda x: x.autocorr(lag=1), raw=False) is painfully slow on long series. On 20 years of daily data (~5000 rows), it takes about 2 seconds on my machine (Python 3.11, pandas 2.1). For a universe of 500 stocks, that’s over 15 minutes for a single feature. If this becomes a bottleneck, you can rewrite it with numpy’s np.corrcoef in a vectorized rolling loop, which cuts the time by roughly 8x. But for prototyping on a single stock, the pandas version is fine.

Handling the Ugly Parts

Real market data has gaps, splits, and regime changes that will quietly corrupt your features if you don’t handle them.

Stock splits and dividends. Always use adjusted close prices for computing returns and indicators. If you’re pulling data from Yahoo Finance (via yfinance), the Adj Close column handles this. But watch out — yfinance retroactively adjusts the entire history when a new split happens, which means your feature values from last month can change. This breaks reproducibility. The workaround is to cache raw data and apply adjustments yourself using a known split calendar.

Gaps from halts and holidays. A stock that’s halted for three days will show up as missing rows in daily data (or as identical OHLC values, depending on the data source). Rolling calculations treat these differently:

# This 5-day return might actually span 8 calendar days if there's a halt
ret_5d = close.pct_change(5)

# More robust: use business day count
feat['trading_days_gap'] = (
    df.index.to_series().diff().dt.days
)
# Flag rows where the gap is suspiciously large
feat['has_gap'] = (feat['trading_days_gap'] > 4).astype(int)

I’ve seen strategies that accidentally generate signals right after a trading halt because the sudden “gap” in price looks like a breakout to momentum indicators. Adding a simple gap flag and filtering those rows in your training data saves headaches downstream.

NaN propagation. This is the one that bites hardest. After all your rolling computations, the first N rows of your feature matrix will be NaN. After the shift(1) for lookahead prevention, you lose one more row. Then if you add lagged features, you lose more. A 21-day rolling window with a 5-day lag on top loses 27 rows. On daily data that’s about a month of data — no big deal. But on hourly bars for a stock that only has 6 months of history? You’ve just lost 10% of your dataset.

The principled approach: compute the effective NaN frontier for each feature and drop rows where any required feature is missing.

def drop_nan_frontier(feat: pd.DataFrame, target: pd.Series) -> tuple:
    """Drop rows where features or target have NaN. Return aligned pair."""
    mask = feat.notna().all(axis=1) & target.notna()
    print(f"Dropping {(~mask).sum()} rows ({(~mask).mean():.1%} of data)")
    return feat.loc[mask], target.loc[mask]

# Usage
features, target = drop_nan_frontier(feat, forward_returns)
# Output: Dropping 34 rows (0.7% of data)

Do not use fillna(0) or forward-fill on features. A missing RSI isn’t zero RSI — it’s undefined. Imputing it introduces a signal where none exists.

Putting It Together: A Complete Feature Matrix

Here’s what the full pipeline looks like, from raw OHLCV data to a clean feature matrix ready for modeling. This builds on the data fetching we set up in Part 2.

import yfinance as yf

ticker = yf.download('AAPL', start='2015-01-01', end='2024-01-01', progress=False)
print(f"Raw data shape: {ticker.shape}")
# Raw data shape: (2264, 6)

# Flatten MultiIndex columns if yfinance returns them (it does since 0.2.31)
if isinstance(ticker.columns, pd.MultiIndex):
    ticker.columns = ticker.columns.get_level_values(0)

ticker.columns = ticker.columns.str.lower()

# Build features
features = build_feature_matrix(ticker, price_col='adj close' if 'adj close' in ticker.columns else 'close')

# Z-score normalize for modeling
features_z = features.apply(lambda col: rolling_zscore(col, window=63))

# Forward return as target (5-day)
target = ticker['close'].pct_change(5).shift(-5)  # what happens NEXT 5 days

features_clean, target_clean = drop_nan_frontier(features_z, target)
print(f"Clean feature matrix: {features_clean.shape}")
print(f"Feature columns: {list(features_clean.columns)}")
# Clean feature matrix: (2097, 21)
# Feature columns: ['ret_1d', 'ret_5d', 'ret_10d', 'ret_21d', 'volatility_21d',
#   'rsi_14', 'macd', 'macd_signal', 'macd_hist', 'bb_width', 'bb_pctb',
#   'volume_sma_ratio', 'volume_zscore', 'atr_14', 'atr_pct',
#   'price_position_10d', 'price_position_50d', 'bearish_divergence',
#   'divergence_vol_confirm', 'volatility_lag1', 'ret_autocorr_21d']

Twenty-one features from six raw columns. And critically, every single feature at row $t$ is computed using only data available at or before time $t-1$ . No lookahead, no data leakage.

A quick sanity check I always run: compute the correlation between each feature and the target, then verify none of them are suspiciously high. If any feature has $|r| > 0.1$ with 5-day forward returns on daily stock data, something is probably wrong — either lookahead bias or a bug in the computation.

correlations = features_clean.corrwith(target_clean).sort_values()
print(correlations)
# ret_1d              -0.031
# macd_hist           -0.024
# rsi_14              -0.019
# ...
# volume_zscore        0.008
# volatility_21d       0.015
# atr_pct              0.022

Those weak correlations (-0.03 to +0.02) look right. Individual features having low predictive power is expected — if a single RSI value could predict next-week returns with $r = 0.3$ , every quant fund would trade the same signal and it’d be arbitraged away in days. The value comes from combining weak signals, which is exactly what the ML models in Part 6 will do.

What About Hundreds of Features?

There’s a temptation to throw in every indicator you can think of — Ichimoku clouds, Elder-Ray, Chaikin Money Flow, the entire TA-Lib catalog. More features, more signal, right?

Not really. With a universe of, say, 500 features computed on 2000 daily bars, you’re practically guaranteed to find features that correlate with forward returns purely by chance. The multiple testing problem is severe here. If you test 500 features at $p < 0.05$ , you’d expect roughly 25 false positives even under the null hypothesis of zero predictability. This is closely related to the work by Harvey, Liu, and Zhu (2016) in their paper on $p$ -hacking in finance — they argued that the critical $t$ -statistic for a new factor should be around 3.0, not the traditional 1.96, to account for the hundreds of factors that have been data-mined.

So instead of maximizing feature count, focus on features that capture different types of information: momentum (returns), mean reversion (RSI, Bollinger), volatility (ATR, realized vol), and volume. Those four families cover most of the information content in OHLCV data. Adding the 15th variation of a momentum indicator gives you collinearity, not alpha.

A practical dimensionality check: compute the correlation matrix of your features and look for pairs with $|r| > 0.85$ . In our 21-feature set, macd and macd_signal will be highly correlated — that’s expected and fine for tree models, but you’d want to drop one for linear models. The bb_pctb and price_position_10d features also overlap conceptually. Keep both for now; let the model sort it out, or use PCA if you’re worried about it.

$\text{VIF}_j = \frac{1}{1 – R_j^2}$

Variance Inflation Factor is another diagnostic worth running. A VIF above 10 indicates serious multicollinearity. In practice, I’ve found that keeping VIF under 5 for each feature gives cleaner results with linear models, though tree-based models are largely immune to this issue.

But here’s the honest truth: I haven’t rigorously tested whether VIF-based feature selection actually improves out-of-sample Sharpe ratios compared to just throwing everything into a gradient boosting model. My instinct says it helps for smaller datasets, but with enough data the regularization in XGBoost or LightGBM handles collinearity on its own. Take that with a grain of salt.

What Changes for Intraday Data

Everything above assumes daily bars. If you’re working with intraday data (1-minute, 5-minute), three things change dramatically.

First, the window sizes need to shrink. A 20-period SMA on daily data covers a month of market dynamics. A 20-period SMA on 1-minute bars covers 20 minutes — basically noise. For intraday work, I’d scale windows by the ratio of bar frequency to daily: a 20-day SMA becomes roughly a 20 × 78 = 1560-bar SMA on 5-minute data (78 five-minute bars per 6.5-hour trading day). Though in practice, most people use shorter windows intraday because the signal decay is much faster.

Second, overnight gaps become a feature rather than a nuisance. The gap between yesterday’s close and today’s open carries real information — it reflects overnight news, pre-market trading, and institutional order flow.

$\text{gap}_{t} = \frac{\text{open}_{t} – \text{close}_{t-1}}{\text{close}_{t-1}}$

Third, microstructure features start mattering. Bid-ask spread, order book imbalance, trade-to-quote ratio — none of these are computable from OHLCV data, but they’re the bread and butter of high-frequency strategies. That’s a different world from what we’re building here, and honestly one where I don’t have deep expertise.

For this series, we’ll stick with daily data. The concepts transfer to other frequencies, but the parameter choices and feature importance rankings change substantially.

Where This Leads

The feature matrix we built here is the foundation for everything that follows. In Part 4, we’ll feed these features into a backtesting framework and see how a simple signal — long when RSI z-score is below -1.5 and volume z-score is above 1.0 — performs on real data. Spoiler: the raw signal is mediocre, but it gets interesting once you add proper position sizing.

My recommendation for this stage: start with fewer features and add complexity only when you can demonstrate improved out-of-sample performance. The 21 features above are a solid baseline. Resist the urge to add more until your backtest infrastructure (Part 4) is in place to actually measure whether they help.

And if there’s one thing to take away from this part, it’s this: shift(1) your features. Everything else is optimization. Getting the temporal alignment wrong is the error that invalidates entire research pipelines, and it’s invisible unless you’re specifically looking for it.

Quant Investment with Python Series (3/8)

← Previous: Data Collection and Preprocessing for Quant Trading in Python Next: Backtesting Frameworks: Building Your First Trading Strategy →

Did you find this helpful?

☕ Buy me a coffee

Technical Indicators and Feature Engineering with Pandas for Quant Trading

The Feature That Broke My Backtest

Why “Just Use TA-Lib” Isn’t the Answer

The Feature Engineering Part That Actually Matters

Normalization: Z-Scores vs. Rank vs. Percentile

Interaction Features

Lagged Features and Autocorrelation

Handling the Ugly Parts

Putting It Together: A Complete Feature Matrix

What About Hundreds of Features?

What Changes for Intraday Data

Where This Leads

Comments

Leave a Reply Cancel reply

Technical Indicators and Feature Engineering with Pandas for Quant Trading

The Feature That Broke My Backtest

Why “Just Use TA-Lib” Isn’t the Answer

The Feature Engineering Part That Actually Matters

Normalization: Z-Scores vs. Rank vs. Percentile

Interaction Features

Lagged Features and Autocorrelation

Handling the Ugly Parts

Putting It Together: A Complete Feature Matrix

What About Hundreds of Features?

What Changes for Intraday Data

Where This Leads

Related Posts

Comments

Leave a Reply Cancel reply