Exploratory Data Analysis (EDA) for Stock Price Prediction

Updated Feb 6, 2026

Introduction

In the previous episode, we explored the rich landscape of financial datasets available on Kaggle, learning how to identify high-quality data sources for our projects. Now that we have access to financial data, it’s time to roll up our sleeves and dive deep into Exploratory Data Analysis (EDA)—the critical first step in any data science project.

EDA for financial time-series data differs significantly from traditional exploratory analysis. Stock prices exhibit unique characteristics like trends, seasonality, volatility clustering, and non-stationarity that require specialized techniques. In this episode, we’ll explore how to visualize stock price data effectively, calculate and interpret moving averages, analyze volatility patterns, and uncover correlations between different assets.

By the end of this tutorial, you’ll have a comprehensive toolkit for exploring financial data and extracting actionable insights that form the foundation for predictive modeling.

Setting Up Our Environment

Let’s start by importing the necessary libraries and loading a real Kaggle dataset. We’ll use the popular S&P 500 Stock Data dataset, which contains historical price information for major US companies.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings

warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (14, 7)

# Load stock data (example using S&P 500 dataset from Kaggle)
df = pd.read_csv('all_stocks_5yr.csv')

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")
print(f"\nNumber of unique stocks: {df['Name'].nunique()}")

# Preview the data
df.head()

Understanding Stock Price Data Structure

Financial market data typically contains several key components:

Column Description Importance for Analysis
Open Price at market opening Captures overnight sentiment shifts
High Highest intraday price Indicates bullish pressure
Low Lowest intraday price Shows bearish pressure
Close Price at market closing Most commonly used for analysis
Volume Number of shares traded Measures market interest and liquidity
Adjusted Close Close price adjusted for splits/dividends Best for long-term analysis

Let’s examine a single stock in detail:

# Filter data for Apple (AAPL) as an example
aapl = df[df['Name'] == 'AAPL'].copy()
aapl = aapl.sort_values('date').reset_index(drop=True)

print(f"AAPL data points: {len(aapl)}")
print(f"\nBasic statistics:")
print(aapl[['open', 'high', 'low', 'close', 'volume']].describe())

# Check for missing values
print(f"\nMissing values:\n{aapl.isnull().sum()}")

Visualizing Stock Price Movements

Basic Price Chart with Volume

The foundation of financial EDA is visualizing price movements over time. A candlestick chart or line chart combined with volume provides essential insights.

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), 
                                gridspec_kw={'height_ratios': [3, 1]})

# Price chart
ax1.plot(aapl['date'], aapl['close'], linewidth=2, label='Close Price', color='#2E86DE')
ax1.fill_between(aapl['date'], aapl['low'], aapl['high'], 
                  alpha=0.2, color='#54A0FF', label='Daily Range')
ax1.set_title('AAPL Stock Price (2013-2018)', fontsize=16, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12)
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)

# Volume chart
colors = ['#10AC84' if aapl['close'].iloc[i] >= aapl['open'].iloc[i] 
          else '#EE5A6F' for i in range(len(aapl))]
ax2.bar(aapl['date'], aapl['volume'], color=colors, alpha=0.6)
ax2.set_ylabel('Volume', fontsize=12)
ax2.set_xlabel('Date', fontsize=12)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Key Insight: Volume spikes often accompany significant price movements. High volume during price increases suggests strong buying interest, while high volume during declines indicates aggressive selling.

When analyzing stocks over extended periods with significant growth, logarithmic scaling reveals percentage changes more clearly:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Linear scale
ax1.plot(aapl['date'], aapl['close'], linewidth=2, color='#2E86DE')
ax1.set_title('Linear Scale', fontsize=14, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12)
ax1.grid(True, alpha=0.3)

# Logarithmic scale
ax2.plot(aapl['date'], aapl['close'], linewidth=2, color='#8E44AD')
ax2.set_yscale('log')
ax2.set_title('Logarithmic Scale', fontsize=14, fontweight='bold')
ax2.set_ylabel('Price ($, log scale)', fontsize=12)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Moving Averages: Smoothing the Noise

Moving averages are fundamental technical indicators that smooth out price fluctuations to reveal underlying trends. The most common types are:

Simple Moving Average (SMA)

The SMA is calculated as:

SMAt=1ni=0n1PtiSMA_t = \frac{1}{n} \sum_{i=0}^{n-1} P_{t-i}

Where:
SMAtSMA_t = Simple moving average at time tt
nn = Number of periods (window size)
PiP_i = Price at time ii

Exponential Moving Average (EMA)

The EMA gives more weight to recent prices:

EMAt=αPt+(1α)EMAt1EMA_t = \alpha \cdot P_t + (1 – \alpha) \cdot EMA_{t-1}

Where:
α\alpha = Smoothing factor
PtP_t = Current price
EMAt1EMA_{t-1} = Previous EMA value

Let’s implement both:

# Calculate moving averages
aapl['SMA_20'] = aapl['close'].rolling(window=20).mean()
aapl['SMA_50'] = aapl['close'].rolling(window=50).mean()
aapl['SMA_200'] = aapl['close'].rolling(window=200).mean()
aapl['EMA_12'] = aapl['close'].ewm(span=12, adjust=False).mean()
aapl['EMA_26'] = aapl['close'].ewm(span=26, adjust=False).mean()

# Visualize moving averages
plt.figure(figsize=(14, 7))
plt.plot(aapl['date'], aapl['close'], linewidth=1.5, 
         label='Close Price', color='#34495E', alpha=0.7)
plt.plot(aapl['date'], aapl['SMA_20'], linewidth=2, 
         label='20-day SMA', color='#3498DB')
plt.plot(aapl['date'], aapl['SMA_50'], linewidth=2, 
         label='50-day SMA', color='#E74C3C')
plt.plot(aapl['date'], aapl['SMA_200'], linewidth=2, 
         label='200-day SMA', color='#2ECC71')

plt.title('AAPL Price with Multiple Moving Averages', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.legend(loc='upper left', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Moving Average Crossover Signals

Golden Cross and Death Cross are classic trading signals:

  • Golden Cross: 50-day SMA crosses above 200-day SMA (bullish signal)
  • Death Cross: 50-day SMA crosses below 200-day SMA (bearish signal)
# Detect crossover points
aapl['Signal'] = 0
aapl.loc[aapl['SMA_50'] > aapl['SMA_200'], 'Signal'] = 1
aapl['Position'] = aapl['Signal'].diff()

# Identify crossover dates
golden_crosses = aapl[aapl['Position'] == 1]
death_crosses = aapl[aapl['Position'] == -1]

print(f"Golden Crosses detected: {len(golden_crosses)}")
print(f"Death Crosses detected: {len(death_crosses)}")

# Visualize crossovers
plt.figure(figsize=(14, 7))
plt.plot(aapl['date'], aapl['close'], linewidth=1.5, 
         label='Close Price', color='#34495E', alpha=0.6)
plt.plot(aapl['date'], aapl['SMA_50'], linewidth=2, 
         label='50-day SMA', color='#3498DB')
plt.plot(aapl['date'], aapl['SMA_200'], linewidth=2, 
         label='200-day SMA', color='#E74C3C')

# Mark crossover points
plt.scatter(golden_crosses['date'], golden_crosses['close'], 
            color='green', s=100, marker='^', label='Golden Cross', zorder=5)
plt.scatter(death_crosses['date'], death_crosses['close'], 
            color='red', s=100, marker='v', label='Death Cross', zorder=5)

plt.title('Moving Average Crossover Signals', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.legend(loc='upper left', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Volatility Analysis: Measuring Market Risk

Volatility measures the degree of price variation over time—a critical metric for risk assessment. High volatility indicates greater uncertainty and risk.

Historical Volatility

Historical volatility is calculated using logarithmic returns:

rt=ln(PtPt1)r_t = \ln\left(\frac{P_t}{P_{t-1}}\right)
σannual=σdailyT\sigma_{annual} = \sigma_{daily} \cdot \sqrt{T}

Where:
rtr_t = Log return at time tt
PtP_t = Price at time tt
σannual\sigma_{annual} = Annualized volatility
TT = Trading days per year
rˉ\bar{r} = Mean return

# Calculate returns
aapl['returns'] = np.log(aapl['close'] / aapl['close'].shift(1))
aapl['returns_pct'] = aapl['close'].pct_change()

# Calculate rolling volatility (20-day window)
aapl['volatility_20'] = aapl['returns'].rolling(window=20).std() * np.sqrt(252)

# Calculate daily price range (High-Low)
aapl['daily_range'] = (aapl['high'] - aapl['low']) / aapl['close'] * 100

# Visualize volatility
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(14, 12))

# Price
ax1.plot(aapl['date'], aapl['close'], linewidth=2, color='#2E86DE')
ax1.set_title('AAPL Price', fontsize=14, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12)
ax1.grid(True, alpha=0.3)

# Returns distribution
ax2.plot(aapl['date'], aapl['returns_pct'] * 100, 
         linewidth=1, color='#8E44AD', alpha=0.7)
ax2.axhline(y=0, color='red', linestyle='--', linewidth=1)
ax2.set_title('Daily Returns (%)', fontsize=14, fontweight='bold')
ax2.set_ylabel('Return (%)', fontsize=12)
ax2.grid(True, alpha=0.3)

# Rolling volatility
ax3.plot(aapl['date'], aapl['volatility_20'] * 100, 
         linewidth=2, color='#E74C3C')
ax3.fill_between(aapl['date'], 0, aapl['volatility_20'] * 100, 
                  alpha=0.3, color='#E74C3C')
ax3.set_title('20-Day Rolling Volatility (Annualized)', fontsize=14, fontweight='bold')
ax3.set_ylabel('Volatility (%)', fontsize=12)
ax3.set_xlabel('Date', fontsize=12)
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary statistics
print("\nReturn Statistics:")
print(f"Mean daily return: {aapl['returns_pct'].mean()*100:.4f}%")
print(f"Std dev of returns: {aapl['returns_pct'].std()*100:.4f}%")
print(f"Annualized volatility: {aapl['returns_pct'].std()*np.sqrt(252)*100:.2f}%")
print(f"\nSkewness: {aapl['returns_pct'].skew():.4f}")
print(f"Kurtosis: {aapl['returns_pct'].kurtosis():.4f}")

Key Insight: Positive kurtosis (> 3) indicates fat tails—financial returns have more extreme values than a normal distribution, meaning large price swings occur more frequently than expected.

Bollinger Bands

Bollinger Bands combine moving averages with volatility to identify overbought/oversold conditions:

Upper=SMA+(kσ)Upper = SMA + (k \cdot \sigma)
Middle=SMAMiddle = SMA
Lower=SMA(kσ)Lower = SMA – (k \cdot \sigma)

Where kk is typically 2 (representing 2 standard deviations).

# Calculate Bollinger Bands
window = 20
aapl['BB_middle'] = aapl['close'].rolling(window=window).mean()
aapl['BB_std'] = aapl['close'].rolling(window=window).std()
aapl['BB_upper'] = aapl['BB_middle'] + 2 * aapl['BB_std']
aapl['BB_lower'] = aapl['BB_middle'] - 2 * aapl['BB_std']

# Calculate Band Width (volatility indicator)
aapl['BB_width'] = (aapl['BB_upper'] - aapl['BB_lower']) / aapl['BB_middle']

# Visualize Bollinger Bands
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), 
                                gridspec_kw={'height_ratios': [3, 1]})

ax1.plot(aapl['date'], aapl['close'], linewidth=2, label='Close Price', color='#2E86DE')
ax1.plot(aapl['date'], aapl['BB_upper'], linewidth=1.5, 
         label='Upper Band', color='#E74C3C', linestyle='--')
ax1.plot(aapl['date'], aapl['BB_middle'], linewidth=1.5, 
         label='Middle Band (20-SMA)', color='#95A5A6')
ax1.plot(aapl['date'], aapl['BB_lower'], linewidth=1.5, 
         label='Lower Band', color='#2ECC71', linestyle='--')
ax1.fill_between(aapl['date'], aapl['BB_lower'], aapl['BB_upper'], 
                  alpha=0.1, color='#BDC3C7')
ax1.set_title('AAPL Bollinger Bands', fontsize=16, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12)
ax1.legend(loc='upper left', fontsize=11)
ax1.grid(True, alpha=0.3)

ax2.plot(aapl['date'], aapl['BB_width'], linewidth=2, color='#9B59B6')
ax2.set_title('Bollinger Band Width (Volatility)', fontsize=14, fontweight='bold')
ax2.set_ylabel('Band Width', fontsize=12)
ax2.set_xlabel('Date', fontsize=12)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Correlation Analysis Between Assets

Understanding correlations between different stocks is essential for portfolio diversification and risk management.

Preparing Multi-Asset Data

# Select multiple tech stocks for comparison
tickers = ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'FB']

# Create a pivot table with closing prices
price_data = df[df['Name'].isin(tickers)].pivot(index='date', 
                                                  columns='Name', 
                                                  values='close')

# Calculate returns for each stock
returns_data = price_data.pct_change().dropna()

print(f"\nPrice data shape: {price_data.shape}")
print(f"Returns data shape: {returns_data.shape}")
returns_data.head()

Correlation Matrix and Heatmap

# Calculate correlation matrix
corr_matrix = returns_data.corr()

print("\nCorrelation Matrix:")
print(corr_matrix.round(3))

# Visualize correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='RdYlGn', 
            center=0, square=True, linewidths=2, 
            cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Asset Correlation Matrix (Daily Returns)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

Key Insight: High positive correlations (> 0.7) between assets mean they tend to move together, offering less diversification benefit. For risk reduction, seek assets with low or negative correlations.

Rolling Correlation Analysis

Correlations change over time, especially during market crises:

# Calculate 60-day rolling correlation between AAPL and GOOGL
rolling_corr = returns_data['AAPL'].rolling(window=60).corr(returns_data['GOOGL'])

plt.figure(figsize=(14, 6))
plt.plot(rolling_corr.index, rolling_corr.values, linewidth=2, color='#8E44AD')
plt.axhline(y=0.5, color='red', linestyle='--', linewidth=1, alpha=0.7, label='0.5 threshold')
plt.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)
plt.fill_between(rolling_corr.index, 0, rolling_corr.values, 
                  where=(rolling_corr.values > 0.5), alpha=0.3, color='#E74C3C')
plt.title('60-Day Rolling Correlation: AAPL vs GOOGL', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Correlation Coefficient', fontsize=12)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Scatter Plot Matrix (Pair Plot)

Visually examine pairwise relationships:

# Create pair plot for returns
sns.pairplot(returns_data * 100, diag_kind='kde', corner=True, 
             plot_kws={'alpha': 0.6, 's': 20}, 
             diag_kws={'linewidth': 2})
plt.suptitle('Pairwise Returns Scatter Matrix (%)', 
             fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

Covariance Analysis

The covariance matrix is fundamental for portfolio optimization:

Cov(X,Y)=1n1i=1n(XiXˉ)(YiYˉ)Cov(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i – \bar{X})(Y_i – \bar{Y})

# Calculate covariance matrix (annualized)
cov_matrix = returns_data.cov() * 252

print("\nAnnualized Covariance Matrix:")
print(cov_matrix.round(6))

# Visualize covariance
plt.figure(figsize=(10, 8))
sns.heatmap(cov_matrix, annot=True, fmt='.6f', cmap='coolwarm', 
            square=True, linewidths=2, 
            cbar_kws={'label': 'Covariance'})
plt.title('Asset Covariance Matrix (Annualized)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

Advanced EDA Techniques

Autocorrelation Analysis

Autocorrelation measures whether past returns predict future returns:

from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Price autocorrelation
autocorrelation_plot(aapl['close'].dropna(), ax=axes[0, 0])
axes[0, 0].set_title('Price Autocorrelation', fontsize=14, fontweight='bold')

# Returns autocorrelation
autocorrelation_plot(aapl['returns'].dropna(), ax=axes[0, 1])
axes[0, 1].set_title('Returns Autocorrelation', fontsize=14, fontweight='bold')

# ACF plot for returns
plot_acf(aapl['returns'].dropna(), lags=40, ax=axes[1, 0])
axes[1, 0].set_title('ACF of Returns', fontsize=14, fontweight='bold')

# PACF plot for returns
plot_pacf(aapl['returns'].dropna(), lags=40, ax=axes[1, 1])
axes[1, 1].set_title('PACF of Returns', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

Key Insight: Stock prices show strong autocorrelation (momentum), but returns typically show weak autocorrelation, supporting the efficient market hypothesis.

Distribution Analysis

Compare returns distribution to normal distribution:

from scipy import stats

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Histogram with normal distribution overlay
aapl_returns_clean = aapl['returns_pct'].dropna() * 100
mu, sigma = aapl_returns_clean.mean(), aapl_returns_clean.std()

axes[0].hist(aapl_returns_clean, bins=50, density=True, 
             alpha=0.7, color='#3498DB', edgecolor='black')
xmin, xmax = axes[0].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, sigma)
axes[0].plot(x, p, 'r-', linewidth=2, label=f'Normal: μ={mu:.3f}, σ={sigma:.3f}')
axes[0].set_title('Returns Distribution vs Normal', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Daily Return (%)', fontsize=12)
axes[0].set_ylabel('Density', fontsize=12)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Q-Q plot
stats.probplot(aapl_returns_clean, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical tests
print("\nNormality Tests:")
shapiro_stat, shapiro_p = stats.shapiro(aapl_returns_clean)
print(f"Shapiro-Wilk test: statistic={shapiro_stat:.6f}, p-value={shapiro_p:.6f}")
ks_stat, ks_p = stats.kstest(aapl_returns_clean, 'norm', args=(mu, sigma))
print(f"Kolmogorov-Smirnov test: statistic={ks_stat:.6f}, p-value={ks_p:.6f}")

Practical EDA Workflow Summary

Here’s a complete EDA pipeline you can reuse:

def comprehensive_stock_eda(ticker_symbol, data):
    """
    Perform comprehensive EDA on a single stock.

    Parameters:
    - ticker_symbol: Stock ticker (e.g., 'AAPL')
    - data: Full dataset containing the stock

    Returns:
    - Dictionary with key statistics and insights
    """
    # Filter and prepare data
    stock = data[data['Name'] == ticker_symbol].copy()
    stock = stock.sort_values('date').reset_index(drop=True)

    # Calculate metrics
    stock['returns'] = stock['close'].pct_change()
    stock['log_returns'] = np.log(stock['close'] / stock['close'].shift(1))
    stock['volatility_20'] = stock['log_returns'].rolling(20).std() * np.sqrt(252)
    stock['SMA_20'] = stock['close'].rolling(20).mean()
    stock['SMA_50'] = stock['close'].rolling(50).mean()
    stock['volume_ma'] = stock['volume'].rolling(20).mean()

    # Summary statistics
    results = {
        'ticker': ticker_symbol,
        'data_points': len(stock),
        'date_range': (stock['date'].min(), stock['date'].max()),
        'price_start': stock['close'].iloc[0],
        'price_end': stock['close'].iloc[-1],
        'total_return': ((stock['close'].iloc[-1] / stock['close'].iloc[0]) - 1) * 100,
        'mean_daily_return': stock['returns'].mean() * 100,
        'volatility_annual': stock['returns'].std() * np.sqrt(252) * 100,
        'max_drawdown': ((stock['close'].cummax() - stock['close']) / stock['close'].cummax()).max() * 100,
        'sharpe_ratio': (stock['returns'].mean() / stock['returns'].std()) * np.sqrt(252),
        'avg_volume': stock['volume'].mean(),
        'skewness': stock['returns'].skew(),
        'kurtosis': stock['returns'].kurtosis()
    }

    return stock, results

# Example usage
aapl_analyzed, aapl_stats = comprehensive_stock_eda('AAPL', df)

print("\n=== AAPL EDA Summary ===")
for key, value in aapl_stats.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value}")

Conclusion

In this episode, we’ve built a comprehensive toolkit for exploratory analysis of financial time-series data. We covered:

  • Data visualization techniques for price, volume, and trends using linear and logarithmic scales
  • Moving averages (SMA and EMA) and crossover signals for trend identification
  • Volatility analysis using historical volatility, Bollinger Bands, and return distributions
  • Correlation analysis between multiple assets for portfolio diversification insights
  • Advanced techniques like autocorrelation, normality testing, and reusable EDA pipelines

These EDA techniques form the foundation for effective feature engineering and predictive modeling. Understanding price patterns, volatility regimes, and inter-asset relationships is critical before building any machine learning model.

In the next episode, we’ll take these insights further by diving into Advanced Feature Engineering for Financial Time-Series. We’ll explore how to create powerful predictive features including technical indicators (RSI, MACD, ATR), time-based features, lag features, and domain-specific transformations that capture market microstructure. We’ll also discuss feature selection techniques to avoid overfitting in financial models.

Happy analyzing! 📊

Mastering Financial Data Science with Kaggle Series (2/6)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 371 | TOTAL 2,594