Introduction
In Part 1, we explored how interactive dashboards transform static charts into engaging data stories. But even the most beautiful visualization can mislead if the underlying analysis is statistically flawed. This episode tackles the most common statistical pitfalls in data analysis—and how to avoid them.
We’ll dissect spurious correlations, unmask confounding variables, and learn when correlation actually hints at causation. Using real Kaggle datasets and Python’s statistical toolkit, you’ll build the rigor needed to make defensible claims from data.
The Spurious Correlation Trap
What Makes a Correlation Spurious?
A spurious correlation occurs when two variables appear related, but the relationship is either coincidental or driven by a hidden third variable. The classic example: ice cream sales and drowning deaths both peak in summer. Does ice cream cause drownings? No—the confounding variable is temperature.
Real-World Example: Kaggle Medical Cost Dataset
Let’s load the Medical Cost Personal dataset and explore potential spurious relationships:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
# Load dataset
df = pd.read_csv('insurance.csv')
# Add synthetic time variable for demonstration
np.random.seed(42)
df['year'] = np.random.choice(range(2015, 2024), len(df))
# Create spurious correlation: both charges and year trend upward
df_yearly = df.groupby('year').agg({'charges': 'mean', 'bmi': 'mean'}).reset_index()
print(df.head())
print(f"\nCorrelation between year and charges: {df['year'].corr(df['charges']):.3f}")
If we find a correlation between year and charges, is it meaningful? Not necessarily—inflation, policy changes, or dataset sampling could drive both upward independently.
Interactive Visualization: Correlation Does Not Imply Causation
# Create interactive scatter with regression line
fig = px.scatter(df, x='bmi', y='charges', color='smoker',
trendline='ols', opacity=0.6,
title='BMI vs Medical Charges: Confounding by Smoking Status',
labels={'bmi': 'Body Mass Index', 'charges': 'Medical Charges ($)'})
fig.update_layout(height=500, hovermode='closest')
fig.show()
# Calculate correlations within groups
corr_smoker = df[df['smoker']=='yes']['bmi'].corr(df[df['smoker']=='yes']['charges'])
corr_nonsmoker = df[df['smoker']=='no']['bmi'].corr(df[df['smoker']=='no']['charges'])
corr_overall = df['bmi'].corr(df['charges'])
print(f"Overall correlation: {corr_overall:.3f}")
print(f"Smokers only: {corr_smoker:.3f}")
print(f"Non-smokers only: {corr_nonsmoker:.3f}")
The overall correlation between BMI and charges is strong, but smoking status is a major confounder. Smokers have higher charges regardless of BMI, inflating the apparent BMI effect.
Simpson’s Paradox: When Aggregation Misleads
Understanding the Paradox
Simpson’s Paradox occurs when a trend appears in aggregate data but reverses within subgroups. Let’s demonstrate with medical cost data:
# Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100],
labels=['Young', 'Middle', 'Senior'])
# Overall correlation: age vs BMI
overall_corr = df['age'].corr(df['bmi'])
print(f"Overall age-BMI correlation: {overall_corr:.3f}")
# Within-group correlations
for group in ['Young', 'Middle', 'Senior']:
subset = df[df['age_group'] == group]
corr = subset['age'].corr(subset['bmi'])
print(f"{group}: {corr:.3f}")
Interactive Simpson’s Paradox Visualization
fig = make_subplots(rows=1, cols=2,
subplot_titles=('Aggregated View', 'Grouped View'))
# Left: overall trend
fig.add_trace(
go.Scatter(x=df['age'], y=df['bmi'], mode='markers',
marker=dict(color='gray', opacity=0.5),
name='All data'),
row=1, col=1
)
# Add overall regression line
z = np.polyfit(df['age'], df['bmi'], 1)
p = np.poly1d(z)
fig.add_trace(
go.Scatter(x=df['age'].sort_values(),
y=p(df['age'].sort_values()),
mode='lines', line=dict(color='red', width=3),
name=f'Overall trend (r={overall_corr:.2f})'),
row=1, col=1
)
# Right: grouped trends
colors = {'Young': 'green', 'Middle': 'blue', 'Senior': 'purple'}
for group in ['Young', 'Middle', 'Senior']:
subset = df[df['age_group'] == group]
fig.add_trace(
go.Scatter(x=subset['age'], y=subset['bmi'], mode='markers',
marker=dict(color=colors[group], size=6),
name=group),
row=1, col=2
)
# Add group regression line
z = np.polyfit(subset['age'], subset['bmi'], 1)
p = np.poly1d(z)
fig.add_trace(
go.Scatter(x=subset['age'].sort_values(),
y=p(subset['age'].sort_values()),
mode='lines', line=dict(color=colors[group], dash='dash'),
showlegend=False),
row=1, col=2
)
fig.update_xaxes(title_text='Age', row=1, col=1)
fig.update_xaxes(title_text='Age', row=1, col=2)
fig.update_yaxes(title_text='BMI', row=1, col=1)
fig.update_layout(height=500, title_text='Simpson\'s Paradox in Medical Data')
fig.show()
This visualization reveals how aggregate trends can mask opposite patterns within subgroups—a critical insight for stratified analysis.
Statistical Testing Done Right
Beyond the p-value: The Full Picture
A p-value alone is insufficient. You need:
1. p-value: Probability of observing data this extreme if null hypothesis is true (threshold: 0.05)
2. Confidence interval: Range of plausible effect sizes
3. Effect size: Magnitude of difference (Cohen’s d, , etc.)
Hypothesis Testing Example: Do Smokers Pay More?
from scipy.stats import ttest_ind, mannwhitneyu
from numpy import sqrt
# Separate groups
smokers = df[df['smoker'] == 'yes']['charges']
non_smokers = df[df['smoker'] == 'no']['charges']
# t-test (assumes normality)
t_stat, p_val = ttest_ind(smokers, non_smokers)
# Mann-Whitney U test (non-parametric alternative)
u_stat, p_val_mw = mannwhitneyu(smokers, non_smokers)
# Cohen's d (effect size)
pooled_std = sqrt(((len(smokers)-1)*smokers.std()**2 +
(len(non_smokers)-1)*non_smokers.std()**2) /
(len(smokers) + len(non_smokers) - 2))
cohens_d = (smokers.mean() - non_smokers.mean()) / pooled_std
print(f"t-test: t={t_stat:.3f}, p={p_val:.4e}")
print(f"Mann-Whitney U: U={u_stat:.1f}, p={p_val_mw:.4e}")
print(f"Cohen's d: {cohens_d:.3f}")
print(f"Mean difference: ${smokers.mean() - non_smokers.mean():.2f}")
Interpretation:
– p-value < 0.001: Highly significant difference
– Cohen’s d > 0.8: Large effect size (rule of thumb: 0.2=small, 0.5=medium, 0.8=large)
– Mean difference: $23,000+ higher charges for smokers
Confidence Intervals: Quantifying Uncertainty
from scipy.stats import t as t_dist
def mean_confidence_interval(data, confidence=0.95):
"""Calculate mean and confidence interval"""
n = len(data)
mean = data.mean()
se = data.sem() # standard error
margin = se * t_dist.ppf((1 + confidence) / 2, n - 1)
return mean, mean - margin, mean + margin
# Calculate CIs
smoker_mean, smoker_ci_low, smoker_ci_high = mean_confidence_interval(smokers)
nonsmoker_mean, nonsmoker_ci_low, nonsmoker_ci_high = mean_confidence_interval(non_smokers)
# Interactive CI plot
fig = go.Figure()
groups = ['Smokers', 'Non-Smokers']
means = [smoker_mean, nonsmoker_mean]
errors = [[smoker_mean - smoker_ci_low, nonsmoker_mean - nonsmoker_ci_low],
[smoker_ci_high - smoker_mean, nonsmoker_ci_high - nonsmoker_mean]]
fig.add_trace(go.Bar(
x=groups, y=means,
error_y=dict(type='data', symmetric=False,
array=errors[1], arrayminus=errors[0]),
marker_color=['#e74c3c', '#3498db'],
text=[f'${m:,.0f}' for m in means],
textposition='outside'
))
fig.update_layout(
title='Mean Medical Charges with 95% Confidence Intervals',
yaxis_title='Medical Charges ($)',
showlegend=False,
height=500
)
fig.show()
print(f"Smokers: ${smoker_mean:,.2f} [${smoker_ci_low:,.2f}, ${smoker_ci_high:,.2f}]")
print(f"Non-smokers: ${nonsmoker_mean:,.2f} [${nonsmoker_ci_low:,.2f}, ${nonsmoker_ci_high:,.2f}]")
The non-overlapping confidence intervals visually confirm the significant difference.
The Multiple Comparison Problem
Why Testing Everything is Dangerous
If you test 20 hypotheses at , you expect 1 false positive by chance alone. The family-wise error rate (FWER) grows with each test:
where is the number of tests. For and :
You have a 64% chance of a spurious finding!
Bonferroni Correction: Conservative Approach
Adjust significance threshold:
from statsmodels.stats.multitest import multipletests
# Test multiple correlations
variables = ['age', 'bmi', 'children']
p_values = []
for var in variables:
# Test correlation with charges for smokers vs non-smokers
smoker_corr = df[df['smoker']=='yes'][var].corr(df[df['smoker']=='yes']['charges'])
nonsmoker_corr = df[df['smoker']=='no'][var].corr(df[df['smoker']=='no']['charges'])
# Fisher z-transformation for correlation comparison
n1, n2 = len(df[df['smoker']=='yes']), len(df[df['smoker']=='no'])
z1 = np.arctanh(smoker_corr)
z2 = np.arctanh(nonsmoker_corr)
se = sqrt(1/(n1-3) + 1/(n2-3))
z_stat = (z1 - z2) / se
p = 2 * (1 - stats.norm.cdf(abs(z_stat)))
p_values.append(p)
print(f"{var}: p={p:.4f}")
# Apply Bonferroni correction
reject, p_adj_bonf, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
# Apply Benjamini-Hochberg (FDR)
reject_fdr, p_adj_fdr, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
results = pd.DataFrame({
'Variable': variables,
'p-value': p_values,
'Bonferroni adj.': p_adj_bonf,
'FDR adj.': p_adj_fdr,
'Significant (Bonf)': reject,
'Significant (FDR)': reject_fdr
})
print("\n", results)
Benjamini-Hochberg (FDR): Less Conservative
The False Discovery Rate (FDR) controls the expected proportion of false positives among rejected hypotheses—less stringent than Bonferroni but still rigorous.
# Visualize correction impact
fig = go.Figure()
x_pos = np.arange(len(variables))
fig.add_trace(go.Bar(x=variables, y=p_values, name='Raw p-values',
marker_color='lightblue'))
fig.add_trace(go.Bar(x=variables, y=p_adj_bonf, name='Bonferroni',
marker_color='orange'))
fig.add_trace(go.Bar(x=variables, y=p_adj_fdr, name='FDR (BH)',
marker_color='green'))
fig.add_hline(y=0.05, line_dash='dash', line_color='red',
annotation_text='α = 0.05')
fig.update_layout(
title='Multiple Testing Corrections',
yaxis_title='p-value',
barmode='group',
height=500
)
fig.show()
Bootstrap Confidence Intervals: When Theory Fails
Non-Parametric Approach
When data violates normality assumptions, bootstrapping provides robust confidence intervals by resampling:
def bootstrap_ci(data, statistic, n_bootstrap=10000, ci=0.95):
"""Calculate bootstrap confidence interval for any statistic"""
bootstrap_stats = []
n = len(data)
for _ in range(n_bootstrap):
sample = data.sample(n, replace=True)
bootstrap_stats.append(statistic(sample))
alpha = (1 - ci) / 2
lower = np.percentile(bootstrap_stats, alpha * 100)
upper = np.percentile(bootstrap_stats, (1 - alpha) * 100)
return bootstrap_stats, lower, upper
# Bootstrap median charge for smokers
bootstrap_medians, ci_low, ci_high = bootstrap_ci(
df[df['smoker']=='yes']['charges'],
lambda x: x.median(),
n_bootstrap=10000
)
print(f"Bootstrap 95% CI for median: [${ci_low:,.2f}, ${ci_high:,.2f}]")
# Interactive bootstrap distribution
fig = go.Figure()
fig.add_trace(go.Histogram(x=bootstrap_medians, nbinsx=50,
marker_color='steelblue', opacity=0.7,
name='Bootstrap distribution'))
fig.add_vline(x=ci_low, line_dash='dash', line_color='red',
annotation_text=f'Lower CI: ${ci_low:,.0f}')
fig.add_vline(x=ci_high, line_dash='dash', line_color='red',
annotation_text=f'Upper CI: ${ci_high:,.0f}')
fig.update_layout(
title='Bootstrap Distribution of Median Charges (Smokers)',
xaxis_title='Median Charge ($)',
yaxis_title='Frequency',
height=500
)
fig.show()
Visualizing Bootstrap Process
# Animate bootstrap sampling
from plotly.subplots import make_subplots
# Take 50 bootstrap samples for visualization
samples = [df[df['smoker']=='yes']['charges'].sample(100, replace=True).median()
for _ in range(50)]
fig = go.Figure()
# Add frames for animation
for i in range(1, len(samples)+1):
fig.add_trace(go.Histogram(
x=samples[:i],
nbinsx=20,
marker_color='teal',
opacity=0.7,
visible=(i==1) # Only first frame visible initially
))
# Create slider steps
steps = []
for i in range(len(fig.data)):
step = dict(
method="update",
args=[{"visible": [False] * len(fig.data)},
{"title": f"Bootstrap Sample {i+1}/{len(samples)}"}],
)
step["args"][0]["visible"][i] = True
steps.append(step)
sliders = [dict(active=0, steps=steps)]
fig.update_layout(
sliders=sliders,
title=f"Bootstrap Sampling Animation",
xaxis_title="Median Charge ($)",
height=500
)
fig.show()
Permutation Tests: Testing Without Assumptions
The Concept
A permutation test determines if an observed difference is significant by randomly shuffling group labels and recomputing the statistic:
def permutation_test(group1, group2, statistic=lambda x, y: x.mean() - y.mean(),
n_permutations=10000):
"""Non-parametric permutation test"""
observed_diff = statistic(group1, group2)
combined = pd.concat([group1, group2])
n1 = len(group1)
perm_diffs = []
for _ in range(n_permutations):
shuffled = combined.sample(frac=1).reset_index(drop=True)
perm_group1 = shuffled.iloc[:n1]
perm_group2 = shuffled.iloc[n1:]
perm_diffs.append(statistic(perm_group1, perm_group2))
p_value = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))
return observed_diff, perm_diffs, p_value
# Test: Do males pay more than females?
male_charges = df[df['sex']=='male']['charges'].reset_index(drop=True)
female_charges = df[df['sex']=='female']['charges'].reset_index(drop=True)
obs_diff, perm_diffs, p_val = permutation_test(male_charges, female_charges)
print(f"Observed difference: ${obs_diff:,.2f}")
print(f"Permutation p-value: {p_val:.4f}")
# Visualize permutation distribution
fig = go.Figure()
fig.add_trace(go.Histogram(x=perm_diffs, nbinsx=50,
marker_color='lightgray', opacity=0.7,
name='Permutation distribution'))
fig.add_vline(x=obs_diff, line_color='red', line_width=3,
annotation_text=f'Observed: ${obs_diff:,.0f}<br>p={p_val:.3f}')
fig.update_layout(
title='Permutation Test: Male vs Female Medical Charges',
xaxis_title='Mean Difference ($)',
yaxis_title='Frequency',
height=500
)
fig.show()
If the observed difference falls in the tail of the permutation distribution (beyond 95% of shuffled differences), we reject the null hypothesis.
Effect Size Matters: Statistical vs Practical Significance
When p < 0.05 Isn’t Enough
With large samples, even tiny effects become “statistically significant.” Always report effect size:
# Compare statistical vs practical significance
from scipy.stats import pearsonr
# Age vs charges correlation
corr_coef, p_value = pearsonr(df['age'], df['charges'])
r_squared = corr_coef ** 2
print(f"Pearson r: {corr_coef:.4f}")
print(f"p-value: {p_value:.4e}")
print(f"R²: {r_squared:.4f} ({r_squared*100:.2f}% variance explained)")
# Interpretation guide
if abs(corr_coef) < 0.1:
strength = "negligible"
elif abs(corr_coef) < 0.3:
strength = "weak"
elif abs(corr_coef) < 0.5:
strength = "moderate"
else:
strength = "strong"
print(f"Effect size interpretation: {strength}")
Cohen’s d for Group Comparisons
# Compare regions
regions = df['region'].unique()
effect_sizes = []
for i, region1 in enumerate(regions):
for region2 in regions[i+1:]:
group1 = df[df['region']==region1]['charges']
group2 = df[df['region']==region2]['charges']
pooled_std = sqrt(((len(group1)-1)*group1.std()**2 +
(len(group2)-1)*group2.std()**2) /
(len(group1) + len(group2) - 2))
d = (group1.mean() - group2.mean()) / pooled_std
effect_sizes.append({
'Comparison': f"{region1} vs {region2}",
"Cohen's d": d,
'Magnitude': 'Small' if abs(d)<0.5 else 'Medium' if abs(d)<0.8 else 'Large'
})
effect_df = pd.DataFrame(effect_sizes).sort_values("Cohen's d", key=abs, ascending=False)
print(effect_df.to_string(index=False))
Checklist: 10 Questions Before Claiming a Relationship
Before declaring “X causes Y” or “X predicts Y,” ask:
- Is the correlation statistically significant? (p < 0.05 after correction)
- What is the effect size? (Cohen’s d, , odds ratio)
- Have I corrected for multiple comparisons? (Bonferroni, FDR)
- Is there a confounding variable? (Simpson’s paradox check)
- Does the relationship hold in subgroups? (Stratified analysis)
- Is the relationship causal or merely associative? (Temporal order, mechanism)
- Are my statistical assumptions met? (Normality, homoscedasticity, independence)
- Have I visualized the data? (Outliers, non-linearity, heteroscedasticity)
- Is the sample representative? (Selection bias, generalizability)
- Can I replicate the finding? (Cross-validation, hold-out set)
Implementing the Checklist as Code
def statistical_checklist(df, var1, var2, group_col=None, alpha=0.05):
"""Automated statistical rigor checklist"""
results = {}
# 1. Correlation test
corr, p_val = pearsonr(df[var1], df[var2])
results['correlation'] = corr
results['p_value'] = p_val
results['significant'] = p_val < alpha
# 2. Effect size
r_squared = corr ** 2
results['r_squared'] = r_squared
results['variance_explained'] = f"{r_squared*100:.1f}%"
# 3. Check for confounding (if group provided)
if group_col:
group_corrs = df.groupby(group_col).apply(
lambda x: x[var1].corr(x[var2])
)
results['group_correlations'] = group_corrs.to_dict()
results['simpsons_paradox'] = (np.sign(corr) != np.sign(group_corrs)).any()
# 4. Normality check (Shapiro-Wilk)
_, p_norm1 = stats.shapiro(df[var1].sample(min(5000, len(df))))
_, p_norm2 = stats.shapiro(df[var2].sample(min(5000, len(df))))
results['normality_assumption'] = (p_norm1 > 0.05) and (p_norm2 > 0.05)
# 5. Outlier detection (IQR method)
def count_outliers(series):
Q1, Q3 = series.quantile([0.25, 0.75])
IQR = Q3 - Q1
return ((series < Q1 - 1.5*IQR) | (series > Q3 + 1.5*IQR)).sum()
results['outliers_var1'] = count_outliers(df[var1])
results['outliers_var2'] = count_outliers(df[var2])
return results
# Example usage
checklist_result = statistical_checklist(df, 'age', 'charges', group_col='smoker')
for key, value in checklist_result.items():
print(f"{key}: {value}")
Advanced: Controlling for Confounders with Regression
Partial Correlation: Isolating the Effect
When you suspect confounding, use partial correlation to control for third variables:
from scipy.stats import pearsonr
def partial_correlation(df, x, y, control):
"""Calculate partial correlation controlling for a third variable"""
# Residualize x and y against control
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(df[[control]], df[x])
resid_x = df[x] - lr.predict(df[[control]])
lr.fit(df[[control]], df[y])
resid_y = df[y] - lr.predict(df[[control]])
# Correlation of residuals
return pearsonr(resid_x, resid_y)
# Example: Age-charges correlation controlling for BMI
raw_corr, _ = pearsonr(df['age'], df['charges'])
partial_corr, partial_p = partial_correlation(df, 'age', 'charges', 'bmi')
print(f"Raw correlation (age-charges): {raw_corr:.3f}")
print(f"Partial correlation (controlling for BMI): {partial_corr:.3f}")
print(f"p-value: {partial_p:.4e}")
Multiple Regression: The Gold Standard
import statsmodels.api as sm
# Convert categorical variables
df_encoded = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)
# Fit multiple regression
X = df_encoded[['age', 'bmi', 'children', 'sex_male', 'smoker_yes']]
y = df_encoded['charges']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())
# Extract key statistics
print(f"\nR²: {model.rsquared:.3f}")
print(f"Adjusted R²: {model.rsquared_adj:.3f}")
print(f"\nCoefficients with 95% CI:")
for var in X.columns:
coef = model.params[var]
ci = model.conf_int().loc[var]
print(f"{var:15s}: {coef:8.2f} [{ci[0]:8.2f}, {ci[1]:8.2f}]")
The regression output shows each variable’s independent effect after controlling for all others—crucial for causal inference.
Conclusion
Statistical rigor is the foundation of trustworthy data science. In this episode, we’ve dissected spurious correlations, navigated Simpson’s Paradox, and learned to wield p-values, confidence intervals, and effect sizes correctly. We’ve seen how multiple testing inflates false positives and how to correct for it using Bonferroni and FDR methods.
Interactive visualizations—from confidence interval plots to bootstrap distributions—transform abstract statistical concepts into intuitive insights. The 10-question checklist provides a systematic framework to validate your findings before making claims.
In Part 3, we’ll tackle another critical challenge: imbalanced datasets. When 95% of your data belongs to one class, standard models fail spectacularly. We’ll explore SMOTE, ADASYN, and advanced resampling techniques to handle this pervasive problem. See you there!
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply