Advanced Feature Engineering for Financial Time-Series

Updated Feb 6, 2026

Introduction

In the previous episode, we explored how to conduct effective EDA on stock price data, identifying patterns, trends, and anomalies that inform our modeling approach. Now we move into one of the most critical phases of any financial machine learning project: feature engineering.

Feature engineering is where domain knowledge meets data science. Raw stock prices alone rarely capture the complex dynamics of financial markets. By constructing technical indicators, statistical transformations, and temporal features, we create a richer representation that helps our models understand market behavior.

This episode focuses on practical feature engineering techniques specifically designed for financial time-series data, with hands-on examples using real Kaggle datasets.

Setting Up the Environment

Let’s start by loading a financial dataset from Kaggle. We’ll use the S&P 500 Stock Data dataset as our primary example:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Load sample S&P 500 data
df = pd.read_csv('sp500_stocks.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df.set_index('Date', inplace=True)

# Focus on a single stock for demonstration
stock_data = df[df['Symbol'] == 'AAPL'].copy()
print(stock_data.head())

Our baseline dataframe typically contains:
Date: Trading date
Open: Opening price
High: Highest price during the day
Low: Lowest price during the day
Close: Closing price
Volume: Number of shares traded

Understanding Returns: The Foundation

Simple Returns vs. Log Returns

Before diving into technical indicators, we must understand how to properly calculate returns. Financial analysts use two primary types:

Simple Return:

Rt=PtPt1Pt1R_t = \frac{P_t – P_{t-1}}{P_{t-1}}

Where:
RtR_t = return at time tt
PtP_t = price at time tt
Pt1P_{t-1} = price at time t1t-1

Log Return:

rt=ln(PtPt1)r_t = \ln\left(\frac{P_t}{P_{t-1}}\right)

Where:
rtr_t = logarithmic return at time tt
ln\ln = natural logarithm

Why prefer log returns?

  1. Time additivity: Log returns can be summed over time periods
  2. Statistical properties: More closely approximate normal distribution
  3. Symmetry: A 50% gain followed by 50% loss returns to original with log returns
  4. Mathematical convenience: Easier to work with in statistical models
def calculate_returns(df, price_col='Close'):
    """
    Calculate both simple and log returns.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with price data
    price_col : str
        Column name containing prices

    Returns:
    --------
    pd.DataFrame
        DataFrame with added return columns
    """
    # Simple returns
    df['simple_return'] = df[price_col].pct_change()

    # Log returns
    df['log_return'] = np.log(df[price_col] / df[price_col].shift(1))

    return df

stock_data = calculate_returns(stock_data)

# Compare distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].hist(stock_data['simple_return'].dropna(), bins=50, alpha=0.7, color='blue')
axes[0].set_title('Simple Returns Distribution')
axes[0].set_xlabel('Return')

axes[1].hist(stock_data['log_return'].dropna(), bins=50, alpha=0.7, color='green')
axes[1].set_title('Log Returns Distribution')
axes[1].set_xlabel('Log Return')
plt.tight_layout()
plt.show()

Technical Indicators: Capturing Market Psychology

Relative Strength Index (RSI)

RSI measures the magnitude of recent price changes to evaluate overbought or oversold conditions. It ranges from 0 to 100.

RSI=1001001+RSRSI = 100 – \frac{100}{1 + RS}

Where:
RSRS = Average Gain / Average Loss over a period (typically 14 days)

Interpretation:
– RSI > 70: Potentially overbought (sell signal)
– RSI < 30: Potentially oversold (buy signal)

def calculate_rsi(df, column='Close', period=14):
    """
    Calculate Relative Strength Index (RSI).

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with price data
    column : str
        Column to calculate RSI on
    period : int
        Lookback period (default 14)

    Returns:
    --------
    pd.Series
        RSI values
    """
    # Calculate price changes
    delta = df[column].diff()

    # Separate gains and losses
    gain = delta.where(delta > 0, 0)
    loss = -delta.where(delta < 0, 0)

    # Calculate exponential moving averages
    avg_gain = gain.ewm(com=period-1, min_periods=period).mean()
    avg_loss = loss.ewm(com=period-1, min_periods=period).mean()

    # Calculate RS and RSI
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))

    return rsi

stock_data['RSI_14'] = calculate_rsi(stock_data)

# Visualize RSI with price
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

ax1.plot(stock_data.index, stock_data['Close'], label='Close Price', color='black')
ax1.set_ylabel('Price ($)')
ax1.set_title('Stock Price and RSI Indicator')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.plot(stock_data.index, stock_data['RSI_14'], label='RSI (14)', color='purple')
ax2.axhline(y=70, color='r', linestyle='--', label='Overbought (70)')
ax2.axhline(y=30, color='g', linestyle='--', label='Oversold (30)')
ax2.fill_between(stock_data.index, 30, 70, alpha=0.1, color='gray')
ax2.set_ylabel('RSI')
ax2.set_xlabel('Date')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Moving Average Convergence Divergence (MACD)

MACD is a trend-following momentum indicator that shows the relationship between two moving averages.

MACD=EMA12EMA26MACD = EMA_{12} – EMA_{26}
Signal=EMA9(MACD)Signal = EMA_9(MACD)
Histogram=MACDSignalHistogram = MACD – Signal

Where:
EMA12EMA_{12} = 12-period exponential moving average
EMA26EMA_{26} = 26-period exponential moving average
SignalSignal = 9-period EMA of MACD (signal line)

Trading signals:
– MACD crosses above signal line: Bullish signal
– MACD crosses below signal line: Bearish signal

def calculate_macd(df, column='Close', fast=12, slow=26, signal=9):
    """
    Calculate MACD, Signal line, and Histogram.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with price data
    column : str
        Column to calculate MACD on
    fast : int
        Fast EMA period (default 12)
    slow : int
        Slow EMA period (default 26)
    signal : int
        Signal line period (default 9)

    Returns:
    --------
    tuple
        (MACD, Signal, Histogram)
    """
    # Calculate EMAs
    ema_fast = df[column].ewm(span=fast, adjust=False).mean()
    ema_slow = df[column].ewm(span=slow, adjust=False).mean()

    # MACD line
    macd_line = ema_fast - ema_slow

    # Signal line
    signal_line = macd_line.ewm(span=signal, adjust=False).mean()

    # Histogram
    histogram = macd_line - signal_line

    return macd_line, signal_line, histogram

stock_data['MACD'], stock_data['MACD_signal'], stock_data['MACD_hist'] = calculate_macd(stock_data)

# Visualize MACD
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

ax1.plot(stock_data.index, stock_data['Close'], label='Close Price', color='black')
ax1.set_ylabel('Price ($)')
ax1.set_title('Stock Price and MACD Indicator')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.plot(stock_data.index, stock_data['MACD'], label='MACD', color='blue')
ax2.plot(stock_data.index, stock_data['MACD_signal'], label='Signal', color='red')
ax2.bar(stock_data.index, stock_data['MACD_hist'], label='Histogram', color='gray', alpha=0.3)
ax2.set_ylabel('MACD')
ax2.set_xlabel('Date')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Bollinger Bands

Bollinger Bands measure market volatility and provide relative price levels.

UpperBand=SMA20+2σ20Upper Band = SMA_{20} + 2\sigma_{20}
MiddleBand=SMA20Middle Band = SMA_{20}
LowerBand=SMA202σ20Lower Band = SMA_{20} – 2\sigma_{20}

Where:
SMA20SMA_{20} = 20-period simple moving average
σ20\sigma_{20} = 20-period standard deviation

Interpretation:
– Price near upper band: Potentially overbought
– Price near lower band: Potentially oversold
– Band width indicates volatility

def calculate_bollinger_bands(df, column='Close', period=20, num_std=2):
    """
    Calculate Bollinger Bands.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with price data
    column : str
        Column to calculate bands on
    period : int
        Moving average period (default 20)
    num_std : float
        Number of standard deviations (default 2)

    Returns:
    --------
    tuple
        (Middle band, Upper band, Lower band)
    """
    # Middle band (SMA)
    middle_band = df[column].rolling(window=period).mean()

    # Standard deviation
    std = df[column].rolling(window=period).std()

    # Upper and lower bands
    upper_band = middle_band + (num_std * std)
    lower_band = middle_band - (num_std * std)

    return middle_band, upper_band, lower_band

stock_data['BB_middle'], stock_data['BB_upper'], stock_data['BB_lower'] = calculate_bollinger_bands(stock_data)

# Calculate bandwidth as a feature
stock_data['BB_bandwidth'] = (stock_data['BB_upper'] - stock_data['BB_lower']) / stock_data['BB_middle']

# Visualize Bollinger Bands
plt.figure(figsize=(14, 7))
plt.plot(stock_data.index, stock_data['Close'], label='Close Price', color='black', linewidth=1.5)
plt.plot(stock_data.index, stock_data['BB_upper'], label='Upper Band', color='red', linestyle='--', alpha=0.7)
plt.plot(stock_data.index, stock_data['BB_middle'], label='Middle Band (SMA 20)', color='blue', linestyle='--', alpha=0.7)
plt.plot(stock_data.index, stock_data['BB_lower'], label='Lower Band', color='green', linestyle='--', alpha=0.7)
plt.fill_between(stock_data.index, stock_data['BB_upper'], stock_data['BB_lower'], alpha=0.1, color='gray')
plt.title('Bollinger Bands')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Rolling Statistics: Capturing Temporal Dynamics

Rolling statistics help capture changing market conditions over time windows.

Rolling Mean and Volatility

def calculate_rolling_features(df, column='Close', windows=[5, 10, 20, 50, 200]):
    """
    Calculate rolling statistics for multiple windows.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with price data
    column : str
        Column to calculate statistics on
    windows : list
        List of window sizes

    Returns:
    --------
    pd.DataFrame
        DataFrame with added rolling features
    """
    for window in windows:
        # Rolling mean
        df[f'MA_{window}'] = df[column].rolling(window=window).mean()

        # Rolling standard deviation (volatility)
        df[f'STD_{window}'] = df[column].rolling(window=window).std()

        # Rolling min and max
        df[f'MIN_{window}'] = df[column].rolling(window=window).min()
        df[f'MAX_{window}'] = df[column].rolling(window=window).max()

        # Price position within range
        df[f'RANGE_POS_{window}'] = (
            (df[column] - df[f'MIN_{window}']) / 
            (df[f'MAX_{window}'] - df[f'MIN_{window}'])
        )

    return df

stock_data = calculate_rolling_features(stock_data, windows=[10, 20, 50])

# Visualize multiple moving averages
plt.figure(figsize=(14, 7))
plt.plot(stock_data.index, stock_data['Close'], label='Close Price', color='black', linewidth=1.5)
plt.plot(stock_data.index, stock_data['MA_10'], label='MA 10', alpha=0.7)
plt.plot(stock_data.index, stock_data['MA_20'], label='MA 20', alpha=0.7)
plt.plot(stock_data.index, stock_data['MA_50'], label='MA 50', alpha=0.7)
plt.title('Price with Multiple Moving Averages')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Rolling Returns and Momentum

def calculate_momentum_features(df, price_col='Close', windows=[5, 10, 20, 60]):
    """
    Calculate momentum-based features.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with price data
    price_col : str
        Column containing prices
    windows : list
        List of lookback periods

    Returns:
    --------
    pd.DataFrame
        DataFrame with momentum features
    """
    for window in windows:
        # Momentum (price change over period)
        df[f'momentum_{window}'] = df[price_col].diff(window)

        # Rate of change
        df[f'ROC_{window}'] = df[price_col].pct_change(window)

        # Cumulative return over window
        df[f'cumulative_return_{window}'] = (
            (df[price_col] / df[price_col].shift(window)) - 1
        )

    return df

stock_data = calculate_momentum_features(stock_data)

Lag Features: Using Historical Information

Lag features allow models to learn from past values. This is crucial for time-series predictions.

def create_lag_features(df, columns, lags=[1, 2, 3, 5, 10]):
    """
    Create lagged features for specified columns.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with time-series data
    columns : list
        List of column names to create lags for
    lags : list
        List of lag periods

    Returns:
    --------
    pd.DataFrame
        DataFrame with lagged features
    """
    for col in columns:
        for lag in lags:
            df[f'{col}_lag_{lag}'] = df[col].shift(lag)

    return df

# Create lags for key features
lag_columns = ['Close', 'Volume', 'log_return', 'RSI_14']
stock_data = create_lag_features(stock_data, lag_columns, lags=[1, 2, 3, 5, 10])

print(f"Total features created: {len(stock_data.columns)}")
print("\nSample lag features:")
print(stock_data[['Close', 'Close_lag_1', 'Close_lag_2', 'Close_lag_3']].head(10))

Volume-Based Features

Trading volume provides insight into the strength of price movements.

def calculate_volume_features(df):
    """
    Calculate volume-based features.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with OHLCV data

    Returns:
    --------
    pd.DataFrame
        DataFrame with volume features
    """
    # Volume moving averages
    df['volume_MA_5'] = df['Volume'].rolling(window=5).mean()
    df['volume_MA_20'] = df['Volume'].rolling(window=20).mean()

    # Volume ratio
    df['volume_ratio'] = df['Volume'] / df['volume_MA_20']

    # On-Balance Volume (OBV)
    df['price_change'] = df['Close'].diff()
    df['direction'] = np.where(df['price_change'] > 0, 1, -1)
    df['volume_signed'] = df['Volume'] * df['direction']
    df['OBV'] = df['volume_signed'].cumsum()

    # Volume-Weighted Average Price (VWAP)
    df['typical_price'] = (df['High'] + df['Low'] + df['Close']) / 3
    df['VWAP'] = (df['typical_price'] * df['Volume']).cumsum() / df['Volume'].cumsum()

    # Cleanup temporary columns
    df.drop(['price_change', 'direction', 'volume_signed', 'typical_price'], axis=1, inplace=True)

    return df

stock_data = calculate_volume_features(stock_data)

Price Action Features

def calculate_price_action_features(df):
    """
    Calculate price action and candlestick-based features.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with OHLC data

    Returns:
    --------
    pd.DataFrame
        DataFrame with price action features
    """
    # Daily range
    df['daily_range'] = df['High'] - df['Low']
    df['daily_range_pct'] = (df['High'] - df['Low']) / df['Close']

    # Body size (open to close)
    df['body_size'] = abs(df['Close'] - df['Open'])
    df['body_size_pct'] = df['body_size'] / df['Close']

    # Upper and lower shadows
    df['upper_shadow'] = df['High'] - df[['Open', 'Close']].max(axis=1)
    df['lower_shadow'] = df[['Open', 'Close']].min(axis=1) - df['Low']

    # Gap features
    df['gap'] = df['Open'] - df['Close'].shift(1)
    df['gap_pct'] = df['gap'] / df['Close'].shift(1)

    # True Range (for ATR calculation)
    df['high_low'] = df['High'] - df['Low']
    df['high_close'] = abs(df['High'] - df['Close'].shift(1))
    df['low_close'] = abs(df['Low'] - df['Close'].shift(1))
    df['true_range'] = df[['high_low', 'high_close', 'low_close']].max(axis=1)

    # Average True Range (ATR)
    df['ATR_14'] = df['true_range'].rolling(window=14).mean()

    # Cleanup
    df.drop(['high_low', 'high_close', 'low_close'], axis=1, inplace=True)

    return df

stock_data = calculate_price_action_features(stock_data)

Cyclical Time Features

Markets exhibit cyclical patterns based on time of day, day of week, and month.

def create_time_features(df):
    """
    Create cyclical time-based features.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with datetime index

    Returns:
    --------
    pd.DataFrame
        DataFrame with time features
    """
    # Extract time components
    df['day_of_week'] = df.index.dayofweek
    df['day_of_month'] = df.index.day
    df['week_of_year'] = df.index.isocalendar().week
    df['month'] = df.index.month
    df['quarter'] = df.index.quarter

    # Cyclical encoding (sine/cosine transformation)
    df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
    df['day_of_week_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

    df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
    df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

    # Is Monday/Friday (potential day-of-week effects)
    df['is_monday'] = (df['day_of_week'] == 0).astype(int)
    df['is_friday'] = (df['day_of_week'] == 4).astype(int)

    # Is month end/start
    df['is_month_start'] = df.index.is_month_start.astype(int)
    df['is_month_end'] = df.index.is_month_end.astype(int)

    return df

stock_data = create_time_features(stock_data)

Putting It All Together: Feature Engineering Pipeline

class FinancialFeatureEngineer:
    """
    Complete feature engineering pipeline for financial time-series.
    """

    def __init__(self, rsi_period=14, macd_fast=12, macd_slow=26, macd_signal=9,
                 bb_period=20, bb_std=2):
        self.rsi_period = rsi_period
        self.macd_fast = macd_fast
        self.macd_slow = macd_slow
        self.macd_signal = macd_signal
        self.bb_period = bb_period
        self.bb_std = bb_std

    def fit_transform(self, df):
        """
        Apply all feature engineering steps.

        Parameters:
        -----------
        df : pd.DataFrame
            Raw OHLCV dataframe

        Returns:
        --------
        pd.DataFrame
            Dataframe with engineered features
        """
        df = df.copy()

        # Returns
        df = calculate_returns(df)

        # Technical indicators
        df['RSI_14'] = calculate_rsi(df, period=self.rsi_period)
        df['MACD'], df['MACD_signal'], df['MACD_hist'] = calculate_macd(
            df, fast=self.macd_fast, slow=self.macd_slow, signal=self.macd_signal
        )
        df['BB_middle'], df['BB_upper'], df['BB_lower'] = calculate_bollinger_bands(
            df, period=self.bb_period, num_std=self.bb_std
        )
        df['BB_bandwidth'] = (df['BB_upper'] - df['BB_lower']) / df['BB_middle']

        # Rolling statistics
        df = calculate_rolling_features(df, windows=[10, 20, 50])

        # Momentum
        df = calculate_momentum_features(df, windows=[5, 10, 20])

        # Volume features
        df = calculate_volume_features(df)

        # Price action
        df = calculate_price_action_features(df)

        # Time features
        df = create_time_features(df)

        # Lag features
        lag_cols = ['Close', 'log_return', 'RSI_14', 'Volume']
        df = create_lag_features(df, lag_cols, lags=[1, 2, 3, 5])

        return df

# Apply the pipeline
engine = FinancialFeatureEngineer()
stock_data_enriched = engine.fit_transform(stock_data)

print(f"\nOriginal features: 6 (OHLCV + Date)")
print(f"Engineered features: {len(stock_data_enriched.columns)}")
print(f"\nFeature categories created:")
print("- Returns (simple, log)")
print("- Technical indicators (RSI, MACD, Bollinger Bands)")
print("- Rolling statistics (MA, STD, MIN, MAX)")
print("- Momentum features")
print("- Volume features (OBV, VWAP)")
print("- Price action (ranges, shadows, ATR)")
print("- Time features (cyclical encoding)")
print("- Lag features")

Feature Importance and Selection

Not all engineered features will be useful. Let’s identify the most predictive ones:

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

def calculate_feature_importance(df, target_col='log_return', top_n=20):
    """
    Calculate feature importance using Random Forest.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with features
    target_col : str
        Target variable column
    top_n : int
        Number of top features to display

    Returns:
    --------
    pd.DataFrame
        Feature importance rankings
    """
    # Prepare data
    df_clean = df.dropna()

    # Create target (next day's return)
    df_clean['target'] = df_clean[target_col].shift(-1)
    df_clean = df_clean.dropna()

    # Separate features and target
    feature_cols = [col for col in df_clean.columns if col not in 
                    ['target', 'Open', 'High', 'Low', 'Close', 'Volume']]
    X = df_clean[feature_cols]
    y = df_clean['target']

    # Train Random Forest
    rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
    rf.fit(X, y)

    # Get feature importance
    importance_df = pd.DataFrame({
        'feature': feature_cols,
        'importance': rf.feature_importances_
    }).sort_values('importance', ascending=False)

    # Visualize top features
    plt.figure(figsize=(10, 8))
    top_features = importance_df.head(top_n)
    plt.barh(range(len(top_features)), top_features['importance'])
    plt.yticks(range(len(top_features)), top_features['feature'])
    plt.xlabel('Feature Importance')
    plt.title(f'Top {top_n} Most Important Features')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

    return importance_df

importance_df = calculate_feature_importance(stock_data_enriched, top_n=20)
print("\nTop 10 Most Important Features:")
print(importance_df.head(10))

Handling Missing Values and Data Quality

def clean_features(df, method='forward_fill', max_missing_pct=0.3):
    """
    Clean engineered features and handle missing values.

    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with engineered features
    method : str
        Filling method ('forward_fill', 'interpolate', 'drop')
    max_missing_pct : float
        Maximum allowed missing percentage per column

    Returns:
    --------
    pd.DataFrame
        Cleaned dataframe
    """
    df_clean = df.copy()

    # Check missing values
    missing_pct = df_clean.isnull().sum() / len(df_clean)

    # Drop columns with too many missing values
    cols_to_drop = missing_pct[missing_pct > max_missing_pct].index.tolist()
    if cols_to_drop:
        print(f"Dropping {len(cols_to_drop)} columns with >{max_missing_pct*100}% missing:")
        print(cols_to_drop)
        df_clean = df_clean.drop(columns=cols_to_drop)

    # Handle remaining missing values
    if method == 'forward_fill':
        df_clean = df_clean.fillna(method='ffill').fillna(method='bfill')
    elif method == 'interpolate':
        df_clean = df_clean.interpolate(method='linear').fillna(method='bfill')
    elif method == 'drop':
        df_clean = df_clean.dropna()

    # Remove infinite values
    df_clean = df_clean.replace([np.inf, -np.inf], np.nan)
    df_clean = df_clean.fillna(method='ffill').fillna(method='bfill')

    return df_clean

stock_data_clean = clean_features(stock_data_enriched, method='forward_fill')
print(f"\nFinal dataset shape: {stock_data_clean.shape}")
print(f"Remaining missing values: {stock_data_clean.isnull().sum().sum()}")

Real-World Example: Multi-Stock Feature Engineering

# Load full S&P 500 dataset
df_full = pd.read_csv('sp500_stocks.csv')
df_full['Date'] = pd.to_datetime(df_full['Date'])

# Process multiple stocks
stocks = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA']
engineered_data = {}

for symbol in stocks:
    print(f"Processing {symbol}...")
    stock_df = df_full[df_full['Symbol'] == symbol].copy()
    stock_df = stock_df.sort_values('Date').set_index('Date')

    # Apply feature engineering
    engine = FinancialFeatureEngineer()
    stock_df_enriched = engine.fit_transform(stock_df)
    stock_df_clean = clean_features(stock_df_enriched)

    engineered_data[symbol] = stock_df_clean
    print(f"  Features: {stock_df_clean.shape[1]}, Rows: {stock_df_clean.shape[0]}")

# Save processed data
for symbol, data in engineered_data.items():
    data.to_csv(f'features_{symbol}.csv')
    print(f"Saved features_{symbol}.csv")

Summary Table: Key Features Created

Category Features Formula/Method Use Case
Returns log_return, simple_return rt=ln(Pt/Pt1)r_t = \ln(P_t/P_{t-1}) Target variable, momentum
RSI RSI_14 RSI=1001001+RSRSI = 100 – \frac{100}{1+RS} Overbought/oversold detection
MACD MACD, MACD_signal, MACD_hist MACD=EMA12EMA26MACD = EMA_{12} – EMA_{26} Trend and momentum
Bollinger Bands BB_upper, BB_middle, BB_lower, BB_bandwidth BB=SMA20±2σ20BB = SMA_{20} \pm 2\sigma_{20} Volatility and price extremes
Rolling Stats MA_10, MA_20, STD_20, etc. Rolling windows Trend and volatility
Momentum ROC_10, momentum_20 Price changes over periods Trend strength
Volume OBV, VWAP, volume_ratio Cumulative signed volume Confirmation of price moves
Price Action ATR_14, daily_range, gaps True range calculations Volatility measurement
Lags Close_lag_1, RSI_lag_5 Shifted values Historical context
Time day_of_week_sin, month_cos Cyclical encoding Seasonal patterns

Conclusion

In this episode, we’ve transformed raw OHLCV data into a rich feature set ready for machine learning models. We covered:

  • Returns calculation: Understanding the difference between simple and log returns
  • Technical indicators: RSI, MACD, and Bollinger Bands with full mathematical explanations
  • Rolling statistics: Capturing temporal dynamics through moving averages and volatility
  • Lag features: Providing historical context to models
  • Volume analysis: Understanding the conviction behind price movements
  • Feature importance: Identifying the most predictive features

These engineered features capture market psychology, momentum, volatility, and temporal patterns—the essential ingredients for predictive financial models.

Next Episode Preview: In episode 4, we’ll apply these features to build a Credit Risk Scoring Model. We’ll explore classification algorithms, handle imbalanced datasets, and create interpretable risk scores using techniques like logistic regression, gradient boosting, and neural networks. We’ll also dive into model evaluation metrics specific to credit risk, including precision-recall trade-offs and business impact analysis.

Mastering Financial Data Science with Kaggle Series (3/6)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 371 | TOTAL 2,594