Part 2: Statistical Rigor: Why Your Correlation Might Be Spurious

Updated Feb 6, 2026

Introduction

In Part 1, we explored how interactive dashboards transform static charts into engaging data stories. But even the most beautiful visualization can mislead if the underlying analysis is statistically flawed. This episode tackles the most common statistical pitfalls in data analysis—and how to avoid them.

We’ll dissect spurious correlations, unmask confounding variables, and learn when correlation actually hints at causation. Using real Kaggle datasets and Python’s statistical toolkit, you’ll build the rigor needed to make defensible claims from data.

The Spurious Correlation Trap

What Makes a Correlation Spurious?

A spurious correlation occurs when two variables appear related, but the relationship is either coincidental or driven by a hidden third variable. The classic example: ice cream sales and drowning deaths both peak in summer. Does ice cream cause drownings? No—the confounding variable is temperature.

Real-World Example: Kaggle Medical Cost Dataset

Let’s load the Medical Cost Personal dataset and explore potential spurious relationships:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Load dataset
df = pd.read_csv('insurance.csv')

# Add synthetic time variable for demonstration
np.random.seed(42)
df['year'] = np.random.choice(range(2015, 2024), len(df))

# Create spurious correlation: both charges and year trend upward
df_yearly = df.groupby('year').agg({'charges': 'mean', 'bmi': 'mean'}).reset_index()

print(df.head())
print(f"\nCorrelation between year and charges: {df['year'].corr(df['charges']):.3f}")

If we find a correlation between year and charges, is it meaningful? Not necessarily—inflation, policy changes, or dataset sampling could drive both upward independently.

Interactive Visualization: Correlation Does Not Imply Causation

# Create interactive scatter with regression line
fig = px.scatter(df, x='bmi', y='charges', color='smoker',
                 trendline='ols', opacity=0.6,
                 title='BMI vs Medical Charges: Confounding by Smoking Status',
                 labels={'bmi': 'Body Mass Index', 'charges': 'Medical Charges ($)'})

fig.update_layout(height=500, hovermode='closest')
fig.show()

# Calculate correlations within groups
corr_smoker = df[df['smoker']=='yes']['bmi'].corr(df[df['smoker']=='yes']['charges'])
corr_nonsmoker = df[df['smoker']=='no']['bmi'].corr(df[df['smoker']=='no']['charges'])
corr_overall = df['bmi'].corr(df['charges'])

print(f"Overall correlation: {corr_overall:.3f}")
print(f"Smokers only: {corr_smoker:.3f}")
print(f"Non-smokers only: {corr_nonsmoker:.3f}")

The overall correlation between BMI and charges is strong, but smoking status is a major confounder. Smokers have higher charges regardless of BMI, inflating the apparent BMI effect.

Simpson’s Paradox: When Aggregation Misleads

Understanding the Paradox

Simpson’s Paradox occurs when a trend appears in aggregate data but reverses within subgroups. Let’s demonstrate with medical cost data:

# Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], 
                          labels=['Young', 'Middle', 'Senior'])

# Overall correlation: age vs BMI
overall_corr = df['age'].corr(df['bmi'])
print(f"Overall age-BMI correlation: {overall_corr:.3f}")

# Within-group correlations
for group in ['Young', 'Middle', 'Senior']:
    subset = df[df['age_group'] == group]
    corr = subset['age'].corr(subset['bmi'])
    print(f"{group}: {corr:.3f}")

Interactive Simpson’s Paradox Visualization

fig = make_subplots(rows=1, cols=2, 
                    subplot_titles=('Aggregated View', 'Grouped View'))

# Left: overall trend
fig.add_trace(
    go.Scatter(x=df['age'], y=df['bmi'], mode='markers',
               marker=dict(color='gray', opacity=0.5),
               name='All data'),
    row=1, col=1
)

# Add overall regression line
z = np.polyfit(df['age'], df['bmi'], 1)
p = np.poly1d(z)
fig.add_trace(
    go.Scatter(x=df['age'].sort_values(), 
               y=p(df['age'].sort_values()),
               mode='lines', line=dict(color='red', width=3),
               name=f'Overall trend (r={overall_corr:.2f})'),
    row=1, col=1
)

# Right: grouped trends
colors = {'Young': 'green', 'Middle': 'blue', 'Senior': 'purple'}
for group in ['Young', 'Middle', 'Senior']:
    subset = df[df['age_group'] == group]
    fig.add_trace(
        go.Scatter(x=subset['age'], y=subset['bmi'], mode='markers',
                   marker=dict(color=colors[group], size=6),
                   name=group),
        row=1, col=2
    )
    # Add group regression line
    z = np.polyfit(subset['age'], subset['bmi'], 1)
    p = np.poly1d(z)
    fig.add_trace(
        go.Scatter(x=subset['age'].sort_values(),
                   y=p(subset['age'].sort_values()),
                   mode='lines', line=dict(color=colors[group], dash='dash'),
                   showlegend=False),
        row=1, col=2
    )

fig.update_xaxes(title_text='Age', row=1, col=1)
fig.update_xaxes(title_text='Age', row=1, col=2)
fig.update_yaxes(title_text='BMI', row=1, col=1)
fig.update_layout(height=500, title_text='Simpson\'s Paradox in Medical Data')
fig.show()

This visualization reveals how aggregate trends can mask opposite patterns within subgroups—a critical insight for stratified analysis.

Statistical Testing Done Right

Beyond the p-value: The Full Picture

A p-value alone is insufficient. You need:
1. p-value: Probability of observing data this extreme if null hypothesis is true (threshold: 0.05)
2. Confidence interval: Range of plausible effect sizes
3. Effect size: Magnitude of difference (Cohen’s d, r2r^2, etc.)

Hypothesis Testing Example: Do Smokers Pay More?

from scipy.stats import ttest_ind, mannwhitneyu
from numpy import sqrt

# Separate groups
smokers = df[df['smoker'] == 'yes']['charges']
non_smokers = df[df['smoker'] == 'no']['charges']

# t-test (assumes normality)
t_stat, p_val = ttest_ind(smokers, non_smokers)

# Mann-Whitney U test (non-parametric alternative)
u_stat, p_val_mw = mannwhitneyu(smokers, non_smokers)

# Cohen's d (effect size)
pooled_std = sqrt(((len(smokers)-1)*smokers.std()**2 + 
                   (len(non_smokers)-1)*non_smokers.std()**2) / 
                  (len(smokers) + len(non_smokers) - 2))
cohens_d = (smokers.mean() - non_smokers.mean()) / pooled_std

print(f"t-test: t={t_stat:.3f}, p={p_val:.4e}")
print(f"Mann-Whitney U: U={u_stat:.1f}, p={p_val_mw:.4e}")
print(f"Cohen's d: {cohens_d:.3f}")
print(f"Mean difference: ${smokers.mean() - non_smokers.mean():.2f}")

Interpretation:
p-value < 0.001: Highly significant difference
Cohen’s d > 0.8: Large effect size (rule of thumb: 0.2=small, 0.5=medium, 0.8=large)
Mean difference: $23,000+ higher charges for smokers

Confidence Intervals: Quantifying Uncertainty

from scipy.stats import t as t_dist

def mean_confidence_interval(data, confidence=0.95):
    """Calculate mean and confidence interval"""
    n = len(data)
    mean = data.mean()
    se = data.sem()  # standard error
    margin = se * t_dist.ppf((1 + confidence) / 2, n - 1)
    return mean, mean - margin, mean + margin

# Calculate CIs
smoker_mean, smoker_ci_low, smoker_ci_high = mean_confidence_interval(smokers)
nonsmoker_mean, nonsmoker_ci_low, nonsmoker_ci_high = mean_confidence_interval(non_smokers)

# Interactive CI plot
fig = go.Figure()

groups = ['Smokers', 'Non-Smokers']
means = [smoker_mean, nonsmoker_mean]
errors = [[smoker_mean - smoker_ci_low, nonsmoker_mean - nonsmoker_ci_low],
          [smoker_ci_high - smoker_mean, nonsmoker_ci_high - nonsmoker_mean]]

fig.add_trace(go.Bar(
    x=groups, y=means,
    error_y=dict(type='data', symmetric=False,
                 array=errors[1], arrayminus=errors[0]),
    marker_color=['#e74c3c', '#3498db'],
    text=[f'${m:,.0f}' for m in means],
    textposition='outside'
))

fig.update_layout(
    title='Mean Medical Charges with 95% Confidence Intervals',
    yaxis_title='Medical Charges ($)',
    showlegend=False,
    height=500
)
fig.show()

print(f"Smokers: ${smoker_mean:,.2f} [${smoker_ci_low:,.2f}, ${smoker_ci_high:,.2f}]")
print(f"Non-smokers: ${nonsmoker_mean:,.2f} [${nonsmoker_ci_low:,.2f}, ${nonsmoker_ci_high:,.2f}]")

The non-overlapping confidence intervals visually confirm the significant difference.

The Multiple Comparison Problem

Why Testing Everything is Dangerous

If you test 20 hypotheses at α=0.05\alpha = 0.05, you expect 1 false positive by chance alone. The family-wise error rate (FWER) grows with each test:

P(at least one false positive)=1(1α)mP(\text{at least one false positive}) = 1 – (1 – \alpha)^m

where mm is the number of tests. For m=20m=20 and α=0.05\alpha=0.05:

P(false positive)=10.95200.64P(\text{false positive}) = 1 – 0.95^{20} \approx 0.64

You have a 64% chance of a spurious finding!

Bonferroni Correction: Conservative Approach

Adjust significance threshold: αadj=α/m\alpha_{\text{adj}} = \alpha / m

from statsmodels.stats.multitest import multipletests

# Test multiple correlations
variables = ['age', 'bmi', 'children']
p_values = []

for var in variables:
    # Test correlation with charges for smokers vs non-smokers
    smoker_corr = df[df['smoker']=='yes'][var].corr(df[df['smoker']=='yes']['charges'])
    nonsmoker_corr = df[df['smoker']=='no'][var].corr(df[df['smoker']=='no']['charges'])

    # Fisher z-transformation for correlation comparison
    n1, n2 = len(df[df['smoker']=='yes']), len(df[df['smoker']=='no'])
    z1 = np.arctanh(smoker_corr)
    z2 = np.arctanh(nonsmoker_corr)
    se = sqrt(1/(n1-3) + 1/(n2-3))
    z_stat = (z1 - z2) / se
    p = 2 * (1 - stats.norm.cdf(abs(z_stat)))
    p_values.append(p)
    print(f"{var}: p={p:.4f}")

# Apply Bonferroni correction
reject, p_adj_bonf, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')

# Apply Benjamini-Hochberg (FDR)
reject_fdr, p_adj_fdr, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')

results = pd.DataFrame({
    'Variable': variables,
    'p-value': p_values,
    'Bonferroni adj.': p_adj_bonf,
    'FDR adj.': p_adj_fdr,
    'Significant (Bonf)': reject,
    'Significant (FDR)': reject_fdr
})
print("\n", results)

Benjamini-Hochberg (FDR): Less Conservative

The False Discovery Rate (FDR) controls the expected proportion of false positives among rejected hypotheses—less stringent than Bonferroni but still rigorous.

# Visualize correction impact
fig = go.Figure()

x_pos = np.arange(len(variables))
fig.add_trace(go.Bar(x=variables, y=p_values, name='Raw p-values',
                     marker_color='lightblue'))
fig.add_trace(go.Bar(x=variables, y=p_adj_bonf, name='Bonferroni',
                     marker_color='orange'))
fig.add_trace(go.Bar(x=variables, y=p_adj_fdr, name='FDR (BH)',
                     marker_color='green'))

fig.add_hline(y=0.05, line_dash='dash', line_color='red',
              annotation_text='α = 0.05')

fig.update_layout(
    title='Multiple Testing Corrections',
    yaxis_title='p-value',
    barmode='group',
    height=500
)
fig.show()

Bootstrap Confidence Intervals: When Theory Fails

Non-Parametric Approach

When data violates normality assumptions, bootstrapping provides robust confidence intervals by resampling:

def bootstrap_ci(data, statistic, n_bootstrap=10000, ci=0.95):
    """Calculate bootstrap confidence interval for any statistic"""
    bootstrap_stats = []
    n = len(data)

    for _ in range(n_bootstrap):
        sample = data.sample(n, replace=True)
        bootstrap_stats.append(statistic(sample))

    alpha = (1 - ci) / 2
    lower = np.percentile(bootstrap_stats, alpha * 100)
    upper = np.percentile(bootstrap_stats, (1 - alpha) * 100)

    return bootstrap_stats, lower, upper

# Bootstrap median charge for smokers
bootstrap_medians, ci_low, ci_high = bootstrap_ci(
    df[df['smoker']=='yes']['charges'],
    lambda x: x.median(),
    n_bootstrap=10000
)

print(f"Bootstrap 95% CI for median: [${ci_low:,.2f}, ${ci_high:,.2f}]")

# Interactive bootstrap distribution
fig = go.Figure()
fig.add_trace(go.Histogram(x=bootstrap_medians, nbinsx=50,
                           marker_color='steelblue', opacity=0.7,
                           name='Bootstrap distribution'))

fig.add_vline(x=ci_low, line_dash='dash', line_color='red',
              annotation_text=f'Lower CI: ${ci_low:,.0f}')
fig.add_vline(x=ci_high, line_dash='dash', line_color='red',
              annotation_text=f'Upper CI: ${ci_high:,.0f}')

fig.update_layout(
    title='Bootstrap Distribution of Median Charges (Smokers)',
    xaxis_title='Median Charge ($)',
    yaxis_title='Frequency',
    height=500
)
fig.show()

Visualizing Bootstrap Process

# Animate bootstrap sampling
from plotly.subplots import make_subplots

# Take 50 bootstrap samples for visualization
samples = [df[df['smoker']=='yes']['charges'].sample(100, replace=True).median() 
           for _ in range(50)]

fig = go.Figure()

# Add frames for animation
for i in range(1, len(samples)+1):
    fig.add_trace(go.Histogram(
        x=samples[:i],
        nbinsx=20,
        marker_color='teal',
        opacity=0.7,
        visible=(i==1)  # Only first frame visible initially
    ))

# Create slider steps
steps = []
for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
              {"title": f"Bootstrap Sample {i+1}/{len(samples)}"}],
    )
    step["args"][0]["visible"][i] = True
    steps.append(step)

sliders = [dict(active=0, steps=steps)]

fig.update_layout(
    sliders=sliders,
    title=f"Bootstrap Sampling Animation",
    xaxis_title="Median Charge ($)",
    height=500
)
fig.show()

Permutation Tests: Testing Without Assumptions

The Concept

A permutation test determines if an observed difference is significant by randomly shuffling group labels and recomputing the statistic:

def permutation_test(group1, group2, statistic=lambda x, y: x.mean() - y.mean(),
                     n_permutations=10000):
    """Non-parametric permutation test"""
    observed_diff = statistic(group1, group2)
    combined = pd.concat([group1, group2])
    n1 = len(group1)

    perm_diffs = []
    for _ in range(n_permutations):
        shuffled = combined.sample(frac=1).reset_index(drop=True)
        perm_group1 = shuffled.iloc[:n1]
        perm_group2 = shuffled.iloc[n1:]
        perm_diffs.append(statistic(perm_group1, perm_group2))

    p_value = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))
    return observed_diff, perm_diffs, p_value

# Test: Do males pay more than females?
male_charges = df[df['sex']=='male']['charges'].reset_index(drop=True)
female_charges = df[df['sex']=='female']['charges'].reset_index(drop=True)

obs_diff, perm_diffs, p_val = permutation_test(male_charges, female_charges)

print(f"Observed difference: ${obs_diff:,.2f}")
print(f"Permutation p-value: {p_val:.4f}")

# Visualize permutation distribution
fig = go.Figure()

fig.add_trace(go.Histogram(x=perm_diffs, nbinsx=50,
                           marker_color='lightgray', opacity=0.7,
                           name='Permutation distribution'))

fig.add_vline(x=obs_diff, line_color='red', line_width=3,
              annotation_text=f'Observed: ${obs_diff:,.0f}<br>p={p_val:.3f}')

fig.update_layout(
    title='Permutation Test: Male vs Female Medical Charges',
    xaxis_title='Mean Difference ($)',
    yaxis_title='Frequency',
    height=500
)
fig.show()

If the observed difference falls in the tail of the permutation distribution (beyond 95% of shuffled differences), we reject the null hypothesis.

Effect Size Matters: Statistical vs Practical Significance

When p < 0.05 Isn’t Enough

With large samples, even tiny effects become “statistically significant.” Always report effect size:

# Compare statistical vs practical significance
from scipy.stats import pearsonr

# Age vs charges correlation
corr_coef, p_value = pearsonr(df['age'], df['charges'])
r_squared = corr_coef ** 2

print(f"Pearson r: {corr_coef:.4f}")
print(f"p-value: {p_value:.4e}")
print(f"R²: {r_squared:.4f} ({r_squared*100:.2f}% variance explained)")

# Interpretation guide
if abs(corr_coef) < 0.1:
    strength = "negligible"
elif abs(corr_coef) < 0.3:
    strength = "weak"
elif abs(corr_coef) < 0.5:
    strength = "moderate"
else:
    strength = "strong"

print(f"Effect size interpretation: {strength}")

Cohen’s d for Group Comparisons

# Compare regions
regions = df['region'].unique()
effect_sizes = []

for i, region1 in enumerate(regions):
    for region2 in regions[i+1:]:
        group1 = df[df['region']==region1]['charges']
        group2 = df[df['region']==region2]['charges']

        pooled_std = sqrt(((len(group1)-1)*group1.std()**2 + 
                          (len(group2)-1)*group2.std()**2) / 
                         (len(group1) + len(group2) - 2))
        d = (group1.mean() - group2.mean()) / pooled_std

        effect_sizes.append({
            'Comparison': f"{region1} vs {region2}",
            "Cohen's d": d,
            'Magnitude': 'Small' if abs(d)<0.5 else 'Medium' if abs(d)<0.8 else 'Large'
        })

effect_df = pd.DataFrame(effect_sizes).sort_values("Cohen's d", key=abs, ascending=False)
print(effect_df.to_string(index=False))

Checklist: 10 Questions Before Claiming a Relationship

Before declaring “X causes Y” or “X predicts Y,” ask:

  1. Is the correlation statistically significant? (p < 0.05 after correction)
  2. What is the effect size? (Cohen’s d, r2r^2, odds ratio)
  3. Have I corrected for multiple comparisons? (Bonferroni, FDR)
  4. Is there a confounding variable? (Simpson’s paradox check)
  5. Does the relationship hold in subgroups? (Stratified analysis)
  6. Is the relationship causal or merely associative? (Temporal order, mechanism)
  7. Are my statistical assumptions met? (Normality, homoscedasticity, independence)
  8. Have I visualized the data? (Outliers, non-linearity, heteroscedasticity)
  9. Is the sample representative? (Selection bias, generalizability)
  10. Can I replicate the finding? (Cross-validation, hold-out set)

Implementing the Checklist as Code

def statistical_checklist(df, var1, var2, group_col=None, alpha=0.05):
    """Automated statistical rigor checklist"""
    results = {}

    # 1. Correlation test
    corr, p_val = pearsonr(df[var1], df[var2])
    results['correlation'] = corr
    results['p_value'] = p_val
    results['significant'] = p_val < alpha

    # 2. Effect size
    r_squared = corr ** 2
    results['r_squared'] = r_squared
    results['variance_explained'] = f"{r_squared*100:.1f}%"

    # 3. Check for confounding (if group provided)
    if group_col:
        group_corrs = df.groupby(group_col).apply(
            lambda x: x[var1].corr(x[var2])
        )
        results['group_correlations'] = group_corrs.to_dict()
        results['simpsons_paradox'] = (np.sign(corr) != np.sign(group_corrs)).any()

    # 4. Normality check (Shapiro-Wilk)
    _, p_norm1 = stats.shapiro(df[var1].sample(min(5000, len(df))))
    _, p_norm2 = stats.shapiro(df[var2].sample(min(5000, len(df))))
    results['normality_assumption'] = (p_norm1 > 0.05) and (p_norm2 > 0.05)

    # 5. Outlier detection (IQR method)
    def count_outliers(series):
        Q1, Q3 = series.quantile([0.25, 0.75])
        IQR = Q3 - Q1
        return ((series < Q1 - 1.5*IQR) | (series > Q3 + 1.5*IQR)).sum()

    results['outliers_var1'] = count_outliers(df[var1])
    results['outliers_var2'] = count_outliers(df[var2])

    return results

# Example usage
checklist_result = statistical_checklist(df, 'age', 'charges', group_col='smoker')
for key, value in checklist_result.items():
    print(f"{key}: {value}")

Advanced: Controlling for Confounders with Regression

Partial Correlation: Isolating the Effect

When you suspect confounding, use partial correlation to control for third variables:

from scipy.stats import pearsonr

def partial_correlation(df, x, y, control):
    """Calculate partial correlation controlling for a third variable"""
    # Residualize x and y against control
    from sklearn.linear_model import LinearRegression

    lr = LinearRegression()
    lr.fit(df[[control]], df[x])
    resid_x = df[x] - lr.predict(df[[control]])

    lr.fit(df[[control]], df[y])
    resid_y = df[y] - lr.predict(df[[control]])

    # Correlation of residuals
    return pearsonr(resid_x, resid_y)

# Example: Age-charges correlation controlling for BMI
raw_corr, _ = pearsonr(df['age'], df['charges'])
partial_corr, partial_p = partial_correlation(df, 'age', 'charges', 'bmi')

print(f"Raw correlation (age-charges): {raw_corr:.3f}")
print(f"Partial correlation (controlling for BMI): {partial_corr:.3f}")
print(f"p-value: {partial_p:.4e}")

Multiple Regression: The Gold Standard

import statsmodels.api as sm

# Convert categorical variables
df_encoded = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)

# Fit multiple regression
X = df_encoded[['age', 'bmi', 'children', 'sex_male', 'smoker_yes']]
y = df_encoded['charges']
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
print(model.summary())

# Extract key statistics
print(f"\nR²: {model.rsquared:.3f}")
print(f"Adjusted R²: {model.rsquared_adj:.3f}")
print(f"\nCoefficients with 95% CI:")
for var in X.columns:
    coef = model.params[var]
    ci = model.conf_int().loc[var]
    print(f"{var:15s}: {coef:8.2f} [{ci[0]:8.2f}, {ci[1]:8.2f}]")

The regression output shows each variable’s independent effect after controlling for all others—crucial for causal inference.

Conclusion

Statistical rigor is the foundation of trustworthy data science. In this episode, we’ve dissected spurious correlations, navigated Simpson’s Paradox, and learned to wield p-values, confidence intervals, and effect sizes correctly. We’ve seen how multiple testing inflates false positives and how to correct for it using Bonferroni and FDR methods.

Interactive visualizations—from confidence interval plots to bootstrap distributions—transform abstract statistical concepts into intuitive insights. The 10-question checklist provides a systematic framework to validate your findings before making claims.

In Part 3, we’ll tackle another critical challenge: imbalanced datasets. When 95% of your data belongs to one class, standard models fail spectacularly. We’ll explore SMOTE, ADASYN, and advanced resampling techniques to handle this pervasive problem. See you there!

The Art of Data Storytelling: Insights from Kaggle Datasets Series (2/4)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 435 | TOTAL 2,658