Introduction
In the previous episode, we explored the rich landscape of financial datasets available on Kaggle, learning how to identify high-quality data sources for our projects. Now that we have access to financial data, it’s time to roll up our sleeves and dive deep into Exploratory Data Analysis (EDA)—the critical first step in any data science project.
EDA for financial time-series data differs significantly from traditional exploratory analysis. Stock prices exhibit unique characteristics like trends, seasonality, volatility clustering, and non-stationarity that require specialized techniques. In this episode, we’ll explore how to visualize stock price data effectively, calculate and interpret moving averages, analyze volatility patterns, and uncover correlations between different assets.
By the end of this tutorial, you’ll have a comprehensive toolkit for exploring financial data and extracting actionable insights that form the foundation for predictive modeling.
Setting Up Our Environment
Let’s start by importing the necessary libraries and loading a real Kaggle dataset. We’ll use the popular S&P 500 Stock Data dataset, which contains historical price information for major US companies.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')
# Set visualization style
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (14, 7)
# Load stock data (example using S&P 500 dataset from Kaggle)
df = pd.read_csv('all_stocks_5yr.csv')
# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])
# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")
print(f"\nNumber of unique stocks: {df['Name'].nunique()}")
# Preview the data
df.head()
Understanding Stock Price Data Structure
Financial market data typically contains several key components:
| Column | Description | Importance for Analysis |
|---|---|---|
| Open | Price at market opening | Captures overnight sentiment shifts |
| High | Highest intraday price | Indicates bullish pressure |
| Low | Lowest intraday price | Shows bearish pressure |
| Close | Price at market closing | Most commonly used for analysis |
| Volume | Number of shares traded | Measures market interest and liquidity |
| Adjusted Close | Close price adjusted for splits/dividends | Best for long-term analysis |
Let’s examine a single stock in detail:
# Filter data for Apple (AAPL) as an example
aapl = df[df['Name'] == 'AAPL'].copy()
aapl = aapl.sort_values('date').reset_index(drop=True)
print(f"AAPL data points: {len(aapl)}")
print(f"\nBasic statistics:")
print(aapl[['open', 'high', 'low', 'close', 'volume']].describe())
# Check for missing values
print(f"\nMissing values:\n{aapl.isnull().sum()}")
Visualizing Stock Price Movements
Basic Price Chart with Volume
The foundation of financial EDA is visualizing price movements over time. A candlestick chart or line chart combined with volume provides essential insights.
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10),
gridspec_kw={'height_ratios': [3, 1]})
# Price chart
ax1.plot(aapl['date'], aapl['close'], linewidth=2, label='Close Price', color='#2E86DE')
ax1.fill_between(aapl['date'], aapl['low'], aapl['high'],
alpha=0.2, color='#54A0FF', label='Daily Range')
ax1.set_title('AAPL Stock Price (2013-2018)', fontsize=16, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12)
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)
# Volume chart
colors = ['#10AC84' if aapl['close'].iloc[i] >= aapl['open'].iloc[i]
else '#EE5A6F' for i in range(len(aapl))]
ax2.bar(aapl['date'], aapl['volume'], color=colors, alpha=0.6)
ax2.set_ylabel('Volume', fontsize=12)
ax2.set_xlabel('Date', fontsize=12)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Key Insight: Volume spikes often accompany significant price movements. High volume during price increases suggests strong buying interest, while high volume during declines indicates aggressive selling.
Log-Scale Visualization for Long-Term Trends
When analyzing stocks over extended periods with significant growth, logarithmic scaling reveals percentage changes more clearly:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
# Linear scale
ax1.plot(aapl['date'], aapl['close'], linewidth=2, color='#2E86DE')
ax1.set_title('Linear Scale', fontsize=14, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12)
ax1.grid(True, alpha=0.3)
# Logarithmic scale
ax2.plot(aapl['date'], aapl['close'], linewidth=2, color='#8E44AD')
ax2.set_yscale('log')
ax2.set_title('Logarithmic Scale', fontsize=14, fontweight='bold')
ax2.set_ylabel('Price ($, log scale)', fontsize=12)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Moving Averages: Smoothing the Noise
Moving averages are fundamental technical indicators that smooth out price fluctuations to reveal underlying trends. The most common types are:
Simple Moving Average (SMA)
The SMA is calculated as:
Where:
– = Simple moving average at time
– = Number of periods (window size)
– = Price at time
Exponential Moving Average (EMA)
The EMA gives more weight to recent prices:
Where:
– = Smoothing factor
– = Current price
– = Previous EMA value
Let’s implement both:
# Calculate moving averages
aapl['SMA_20'] = aapl['close'].rolling(window=20).mean()
aapl['SMA_50'] = aapl['close'].rolling(window=50).mean()
aapl['SMA_200'] = aapl['close'].rolling(window=200).mean()
aapl['EMA_12'] = aapl['close'].ewm(span=12, adjust=False).mean()
aapl['EMA_26'] = aapl['close'].ewm(span=26, adjust=False).mean()
# Visualize moving averages
plt.figure(figsize=(14, 7))
plt.plot(aapl['date'], aapl['close'], linewidth=1.5,
label='Close Price', color='#34495E', alpha=0.7)
plt.plot(aapl['date'], aapl['SMA_20'], linewidth=2,
label='20-day SMA', color='#3498DB')
plt.plot(aapl['date'], aapl['SMA_50'], linewidth=2,
label='50-day SMA', color='#E74C3C')
plt.plot(aapl['date'], aapl['SMA_200'], linewidth=2,
label='200-day SMA', color='#2ECC71')
plt.title('AAPL Price with Multiple Moving Averages', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.legend(loc='upper left', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Moving Average Crossover Signals
Golden Cross and Death Cross are classic trading signals:
- Golden Cross: 50-day SMA crosses above 200-day SMA (bullish signal)
- Death Cross: 50-day SMA crosses below 200-day SMA (bearish signal)
# Detect crossover points
aapl['Signal'] = 0
aapl.loc[aapl['SMA_50'] > aapl['SMA_200'], 'Signal'] = 1
aapl['Position'] = aapl['Signal'].diff()
# Identify crossover dates
golden_crosses = aapl[aapl['Position'] == 1]
death_crosses = aapl[aapl['Position'] == -1]
print(f"Golden Crosses detected: {len(golden_crosses)}")
print(f"Death Crosses detected: {len(death_crosses)}")
# Visualize crossovers
plt.figure(figsize=(14, 7))
plt.plot(aapl['date'], aapl['close'], linewidth=1.5,
label='Close Price', color='#34495E', alpha=0.6)
plt.plot(aapl['date'], aapl['SMA_50'], linewidth=2,
label='50-day SMA', color='#3498DB')
plt.plot(aapl['date'], aapl['SMA_200'], linewidth=2,
label='200-day SMA', color='#E74C3C')
# Mark crossover points
plt.scatter(golden_crosses['date'], golden_crosses['close'],
color='green', s=100, marker='^', label='Golden Cross', zorder=5)
plt.scatter(death_crosses['date'], death_crosses['close'],
color='red', s=100, marker='v', label='Death Cross', zorder=5)
plt.title('Moving Average Crossover Signals', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.legend(loc='upper left', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Volatility Analysis: Measuring Market Risk
Volatility measures the degree of price variation over time—a critical metric for risk assessment. High volatility indicates greater uncertainty and risk.
Historical Volatility
Historical volatility is calculated using logarithmic returns:
Where:
– = Log return at time
– = Price at time
– = Annualized volatility
– = Trading days per year
– = Mean return
# Calculate returns
aapl['returns'] = np.log(aapl['close'] / aapl['close'].shift(1))
aapl['returns_pct'] = aapl['close'].pct_change()
# Calculate rolling volatility (20-day window)
aapl['volatility_20'] = aapl['returns'].rolling(window=20).std() * np.sqrt(252)
# Calculate daily price range (High-Low)
aapl['daily_range'] = (aapl['high'] - aapl['low']) / aapl['close'] * 100
# Visualize volatility
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(14, 12))
# Price
ax1.plot(aapl['date'], aapl['close'], linewidth=2, color='#2E86DE')
ax1.set_title('AAPL Price', fontsize=14, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12)
ax1.grid(True, alpha=0.3)
# Returns distribution
ax2.plot(aapl['date'], aapl['returns_pct'] * 100,
linewidth=1, color='#8E44AD', alpha=0.7)
ax2.axhline(y=0, color='red', linestyle='--', linewidth=1)
ax2.set_title('Daily Returns (%)', fontsize=14, fontweight='bold')
ax2.set_ylabel('Return (%)', fontsize=12)
ax2.grid(True, alpha=0.3)
# Rolling volatility
ax3.plot(aapl['date'], aapl['volatility_20'] * 100,
linewidth=2, color='#E74C3C')
ax3.fill_between(aapl['date'], 0, aapl['volatility_20'] * 100,
alpha=0.3, color='#E74C3C')
ax3.set_title('20-Day Rolling Volatility (Annualized)', fontsize=14, fontweight='bold')
ax3.set_ylabel('Volatility (%)', fontsize=12)
ax3.set_xlabel('Date', fontsize=12)
ax3.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Summary statistics
print("\nReturn Statistics:")
print(f"Mean daily return: {aapl['returns_pct'].mean()*100:.4f}%")
print(f"Std dev of returns: {aapl['returns_pct'].std()*100:.4f}%")
print(f"Annualized volatility: {aapl['returns_pct'].std()*np.sqrt(252)*100:.2f}%")
print(f"\nSkewness: {aapl['returns_pct'].skew():.4f}")
print(f"Kurtosis: {aapl['returns_pct'].kurtosis():.4f}")
Key Insight: Positive kurtosis (> 3) indicates fat tails—financial returns have more extreme values than a normal distribution, meaning large price swings occur more frequently than expected.
Bollinger Bands
Bollinger Bands combine moving averages with volatility to identify overbought/oversold conditions:
Where is typically 2 (representing 2 standard deviations).
# Calculate Bollinger Bands
window = 20
aapl['BB_middle'] = aapl['close'].rolling(window=window).mean()
aapl['BB_std'] = aapl['close'].rolling(window=window).std()
aapl['BB_upper'] = aapl['BB_middle'] + 2 * aapl['BB_std']
aapl['BB_lower'] = aapl['BB_middle'] - 2 * aapl['BB_std']
# Calculate Band Width (volatility indicator)
aapl['BB_width'] = (aapl['BB_upper'] - aapl['BB_lower']) / aapl['BB_middle']
# Visualize Bollinger Bands
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10),
gridspec_kw={'height_ratios': [3, 1]})
ax1.plot(aapl['date'], aapl['close'], linewidth=2, label='Close Price', color='#2E86DE')
ax1.plot(aapl['date'], aapl['BB_upper'], linewidth=1.5,
label='Upper Band', color='#E74C3C', linestyle='--')
ax1.plot(aapl['date'], aapl['BB_middle'], linewidth=1.5,
label='Middle Band (20-SMA)', color='#95A5A6')
ax1.plot(aapl['date'], aapl['BB_lower'], linewidth=1.5,
label='Lower Band', color='#2ECC71', linestyle='--')
ax1.fill_between(aapl['date'], aapl['BB_lower'], aapl['BB_upper'],
alpha=0.1, color='#BDC3C7')
ax1.set_title('AAPL Bollinger Bands', fontsize=16, fontweight='bold')
ax1.set_ylabel('Price ($)', fontsize=12)
ax1.legend(loc='upper left', fontsize=11)
ax1.grid(True, alpha=0.3)
ax2.plot(aapl['date'], aapl['BB_width'], linewidth=2, color='#9B59B6')
ax2.set_title('Bollinger Band Width (Volatility)', fontsize=14, fontweight='bold')
ax2.set_ylabel('Band Width', fontsize=12)
ax2.set_xlabel('Date', fontsize=12)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Correlation Analysis Between Assets
Understanding correlations between different stocks is essential for portfolio diversification and risk management.
Preparing Multi-Asset Data
# Select multiple tech stocks for comparison
tickers = ['AAPL', 'GOOGL', 'MSFT', 'AMZN', 'FB']
# Create a pivot table with closing prices
price_data = df[df['Name'].isin(tickers)].pivot(index='date',
columns='Name',
values='close')
# Calculate returns for each stock
returns_data = price_data.pct_change().dropna()
print(f"\nPrice data shape: {price_data.shape}")
print(f"Returns data shape: {returns_data.shape}")
returns_data.head()
Correlation Matrix and Heatmap
# Calculate correlation matrix
corr_matrix = returns_data.corr()
print("\nCorrelation Matrix:")
print(corr_matrix.round(3))
# Visualize correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='RdYlGn',
center=0, square=True, linewidths=2,
cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Asset Correlation Matrix (Daily Returns)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
Key Insight: High positive correlations (> 0.7) between assets mean they tend to move together, offering less diversification benefit. For risk reduction, seek assets with low or negative correlations.
Rolling Correlation Analysis
Correlations change over time, especially during market crises:
# Calculate 60-day rolling correlation between AAPL and GOOGL
rolling_corr = returns_data['AAPL'].rolling(window=60).corr(returns_data['GOOGL'])
plt.figure(figsize=(14, 6))
plt.plot(rolling_corr.index, rolling_corr.values, linewidth=2, color='#8E44AD')
plt.axhline(y=0.5, color='red', linestyle='--', linewidth=1, alpha=0.7, label='0.5 threshold')
plt.axhline(y=0, color='gray', linestyle='-', linewidth=0.5)
plt.fill_between(rolling_corr.index, 0, rolling_corr.values,
where=(rolling_corr.values > 0.5), alpha=0.3, color='#E74C3C')
plt.title('60-Day Rolling Correlation: AAPL vs GOOGL', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Correlation Coefficient', fontsize=12)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Scatter Plot Matrix (Pair Plot)
Visually examine pairwise relationships:
# Create pair plot for returns
sns.pairplot(returns_data * 100, diag_kind='kde', corner=True,
plot_kws={'alpha': 0.6, 's': 20},
diag_kws={'linewidth': 2})
plt.suptitle('Pairwise Returns Scatter Matrix (%)',
fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
Covariance Analysis
The covariance matrix is fundamental for portfolio optimization:
# Calculate covariance matrix (annualized)
cov_matrix = returns_data.cov() * 252
print("\nAnnualized Covariance Matrix:")
print(cov_matrix.round(6))
# Visualize covariance
plt.figure(figsize=(10, 8))
sns.heatmap(cov_matrix, annot=True, fmt='.6f', cmap='coolwarm',
square=True, linewidths=2,
cbar_kws={'label': 'Covariance'})
plt.title('Asset Covariance Matrix (Annualized)', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
Advanced EDA Techniques
Autocorrelation Analysis
Autocorrelation measures whether past returns predict future returns:
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Price autocorrelation
autocorrelation_plot(aapl['close'].dropna(), ax=axes[0, 0])
axes[0, 0].set_title('Price Autocorrelation', fontsize=14, fontweight='bold')
# Returns autocorrelation
autocorrelation_plot(aapl['returns'].dropna(), ax=axes[0, 1])
axes[0, 1].set_title('Returns Autocorrelation', fontsize=14, fontweight='bold')
# ACF plot for returns
plot_acf(aapl['returns'].dropna(), lags=40, ax=axes[1, 0])
axes[1, 0].set_title('ACF of Returns', fontsize=14, fontweight='bold')
# PACF plot for returns
plot_pacf(aapl['returns'].dropna(), lags=40, ax=axes[1, 1])
axes[1, 1].set_title('PACF of Returns', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Key Insight: Stock prices show strong autocorrelation (momentum), but returns typically show weak autocorrelation, supporting the efficient market hypothesis.
Distribution Analysis
Compare returns distribution to normal distribution:
from scipy import stats
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Histogram with normal distribution overlay
aapl_returns_clean = aapl['returns_pct'].dropna() * 100
mu, sigma = aapl_returns_clean.mean(), aapl_returns_clean.std()
axes[0].hist(aapl_returns_clean, bins=50, density=True,
alpha=0.7, color='#3498DB', edgecolor='black')
xmin, xmax = axes[0].get_xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, sigma)
axes[0].plot(x, p, 'r-', linewidth=2, label=f'Normal: μ={mu:.3f}, σ={sigma:.3f}')
axes[0].set_title('Returns Distribution vs Normal', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Daily Return (%)', fontsize=12)
axes[0].set_ylabel('Density', fontsize=12)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Q-Q plot
stats.probplot(aapl_returns_clean, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Statistical tests
print("\nNormality Tests:")
shapiro_stat, shapiro_p = stats.shapiro(aapl_returns_clean)
print(f"Shapiro-Wilk test: statistic={shapiro_stat:.6f}, p-value={shapiro_p:.6f}")
ks_stat, ks_p = stats.kstest(aapl_returns_clean, 'norm', args=(mu, sigma))
print(f"Kolmogorov-Smirnov test: statistic={ks_stat:.6f}, p-value={ks_p:.6f}")
Practical EDA Workflow Summary
Here’s a complete EDA pipeline you can reuse:
def comprehensive_stock_eda(ticker_symbol, data):
"""
Perform comprehensive EDA on a single stock.
Parameters:
- ticker_symbol: Stock ticker (e.g., 'AAPL')
- data: Full dataset containing the stock
Returns:
- Dictionary with key statistics and insights
"""
# Filter and prepare data
stock = data[data['Name'] == ticker_symbol].copy()
stock = stock.sort_values('date').reset_index(drop=True)
# Calculate metrics
stock['returns'] = stock['close'].pct_change()
stock['log_returns'] = np.log(stock['close'] / stock['close'].shift(1))
stock['volatility_20'] = stock['log_returns'].rolling(20).std() * np.sqrt(252)
stock['SMA_20'] = stock['close'].rolling(20).mean()
stock['SMA_50'] = stock['close'].rolling(50).mean()
stock['volume_ma'] = stock['volume'].rolling(20).mean()
# Summary statistics
results = {
'ticker': ticker_symbol,
'data_points': len(stock),
'date_range': (stock['date'].min(), stock['date'].max()),
'price_start': stock['close'].iloc[0],
'price_end': stock['close'].iloc[-1],
'total_return': ((stock['close'].iloc[-1] / stock['close'].iloc[0]) - 1) * 100,
'mean_daily_return': stock['returns'].mean() * 100,
'volatility_annual': stock['returns'].std() * np.sqrt(252) * 100,
'max_drawdown': ((stock['close'].cummax() - stock['close']) / stock['close'].cummax()).max() * 100,
'sharpe_ratio': (stock['returns'].mean() / stock['returns'].std()) * np.sqrt(252),
'avg_volume': stock['volume'].mean(),
'skewness': stock['returns'].skew(),
'kurtosis': stock['returns'].kurtosis()
}
return stock, results
# Example usage
aapl_analyzed, aapl_stats = comprehensive_stock_eda('AAPL', df)
print("\n=== AAPL EDA Summary ===")
for key, value in aapl_stats.items():
if isinstance(value, float):
print(f"{key}: {value:.4f}")
else:
print(f"{key}: {value}")
Conclusion
In this episode, we’ve built a comprehensive toolkit for exploratory analysis of financial time-series data. We covered:
- Data visualization techniques for price, volume, and trends using linear and logarithmic scales
- Moving averages (SMA and EMA) and crossover signals for trend identification
- Volatility analysis using historical volatility, Bollinger Bands, and return distributions
- Correlation analysis between multiple assets for portfolio diversification insights
- Advanced techniques like autocorrelation, normality testing, and reusable EDA pipelines
These EDA techniques form the foundation for effective feature engineering and predictive modeling. Understanding price patterns, volatility regimes, and inter-asset relationships is critical before building any machine learning model.
In the next episode, we’ll take these insights further by diving into Advanced Feature Engineering for Financial Time-Series. We’ll explore how to create powerful predictive features including technical indicators (RSI, MACD, ATR), time-based features, lag features, and domain-specific transformations that capture market microstructure. We’ll also discuss feature selection techniques to avoid overfitting in financial models.
Happy analyzing! 📊
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply