Why Frequentist A/B Tests Fail with 50 Conversions
You run an A/B test for two weeks. Variant A: 47 conversions out of 312 visitors (15.1%). Variant B: 61 conversions out of 289 visitors (21.1%). The difference looks promising, but your t-test returns p=0.08. Not significant. “Keep running the test,” your manager says.
But what if you can’t wait another month? What if this is a B2B product where 500 visitors took three months to accumulate? Frequentist methods demand large samples for reliable p-values. Bayesian A/B testing doesn’t.
The Beta-Binomial model gives you probability distributions over conversion rates from day one. Instead of “is this significant?”, you ask “what’s the probability B beats A?” and “by how much?” With 50 conversions per variant, you get actionable answers. With 500, you get near-certainty.
The Math Behind Beta-Binomial Conjugacy
Conversion rate estimation is a Bernoulli process: each visitor either converts (1) or doesn’t (0). The Beta distribution is the conjugate prior for Bernoulli likelihood, which means posterior updates have a closed form.
Start with a prior belief about conversion rate :
Observe conversions out of trials. The likelihood:
The posterior is another Beta distribution:
No MCMC. No numerical integration. Just add your successes to and failures to .
The conjugacy means you can update beliefs incrementally. Day 1: Beta(1,1) prior. Day 2: observe 3 conversions, 27 non-conversions → Beta(4, 28). Day 3: 2 more conversions, 31 non-conversions → Beta(6, 59). The posterior from yesterday becomes today’s prior.
Choosing Your Prior: Uniform vs Informative
Beta(1,1) is uniform over [0,1]. It says “any conversion rate is equally plausible.” For true cold starts, this works. But if you have historical data—say, your site typically converts at 8-12%—you should encode that.
A Beta(8, 92) prior has mean and concentrates probability around 5-12%. With weak data (n=20), this prior dominates. With strong data (n=500), the likelihood overwhelms it. That’s the Bayesian compromise: strong opinions, weakly held.
How much does prior choice matter? Run the numbers:
import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt
# Observed data: 15 conversions out of 100 visitors
conversions = 15
visitors = 100
# Three priors: uniform, weak informative (10% expected), strong informative
priors = [
(1, 1, "Uniform"),
(5, 45, "Weak (10%)"),
(20, 180, "Strong (10%)")
]
theta = np.linspace(0, 0.4, 500)
for alpha, beta_param, label in priors:
posterior = beta(alpha + conversions, beta_param + visitors - conversions)
plt.plot(theta, posterior.pdf(theta), label=f"{label}: mean={posterior.mean():.3f}")
plt.axvline(conversions / visitors, color='red', linestyle='--', label='MLE')
plt.legend()
plt.xlabel('Conversion Rate')
plt.ylabel('Density')
plt.title('Posterior Sensitivity to Prior Choice (n=100)')
plt.show()
With n=100, the weak prior shifts the mean from 0.150 (MLE) to 0.148. The strong prior pulls it to 0.146. Not huge, but enough to affect decisions if you’re close to a threshold.
With n=1000 and 150 conversions, all three posteriors collapse to mean ≈ 0.150. The likelihood dominates. If you’re worried about prior influence, just collect more data—or use Beta(1,1) and live with higher uncertainty early on.
Computing P(B > A) via Closed-Form Integration
You have two posteriors: and . The key question: ?
The exact formula involves the regularized incomplete beta function:
where is the beta function. Implementing this from scratch is tedious and numerically unstable for large parameters.
Monte Carlo is easier and sufficiently accurate:
def prob_b_beats_a(alpha_a, beta_a, alpha_b, beta_b, samples=100000):
"""
Estimate P(theta_B > theta_A) via Monte Carlo sampling.
For small samples (n<50), use at least 100k draws to reduce noise.
For production decisions, 1M samples add negligible cost.
"""
theta_a_samples = np.random.beta(alpha_a, beta_a, samples)
theta_b_samples = np.random.beta(alpha_b, beta_b, samples)
return (theta_b_samples > theta_a_samples).mean()
# Example: A has 47/312, B has 61/289 (from intro)
alpha_a, beta_a = 1 + 47, 1 + (312 - 47) # uniform prior
alpha_b, beta_b = 1 + 61, 1 + (289 - 61)
prob = prob_b_beats_a(alpha_a, beta_a, alpha_b, beta_b)
print(f"P(B > A) = {prob:.4f}") # Output: ~0.9512
With 95% probability, B’s true conversion rate exceeds A’s. That’s actionable even though the frequentist p-value was 0.08.
But how much better is B? Reporting point probabilities without effect size is dangerous. Maybe B beats A by 0.1 percentage points—hardly worth the implementation cost.
Expected Loss and Practical Significance
The expected loss of choosing A (when B might be better) is:
This averages the regret over all possible true conversion rates, weighted by posterior probability. If , you’d lose 0.2 percentage points in expectation by picking A over B.
Symmetrically, measures the loss of picking B when A is better. The decision rule: pick whichever has lower expected loss. In practice, and sum to the expected absolute difference, so you can compute one and infer the other.
def expected_loss(alpha_a, beta_a, alpha_b, beta_b, samples=100000):
"""
Compute expected loss for choosing A over B.
Returns: (loss_if_choose_A, loss_if_choose_B)
"""
theta_a = np.random.beta(alpha_a, beta_a, samples)
theta_b = np.random.beta(alpha_b, beta_b, samples)
loss_a = np.maximum(0, theta_b - theta_a).mean() # regret of choosing A
loss_b = np.maximum(0, theta_a - theta_b).mean() # regret of choosing B
return loss_a, loss_b
loss_a, loss_b = expected_loss(alpha_a, beta_a, alpha_b, beta_b)
print(f"Expected loss if choose A: {loss_a:.5f}")
print(f"Expected loss if choose B: {loss_b:.5f}")
# Output: ~0.00391 vs ~0.00021
Picking A risks losing 0.4 percentage points on average. Picking B risks only 0.02 points. Clear winner.
You can also set a region of practical equivalence (ROPE). If the 95% credible interval for lies entirely above +0.01, you’d declare B “meaningfully better.” This prevents chasing statistical noise.
When Bayesian A/B Testing Breaks Down
I’ve seen teams misuse this in two ways.
First: ignoring sample size entirely. Yes, Beta-Binomial updates from day one. But with n=10 per variant, your posterior is still wide. sounds confident until you realize the 95% credible intervals overlap by 80%. If decisions are expensive (e.g., redesigning a checkout flow), wait for narrower posteriors.
A practical threshold: I won’t act unless the expected loss of the worse variant exceeds 0.5 percentage points and sample size per variant exceeds 100. Below that, I treat it as exploratory.
Second: optional stopping without adjustment. Bayesian methods tolerate peeking better than frequentist tests (no p-value inflation), but they’re not immune. If you check the test every hour and stop the instant , you’ll overestimate effects. The prior should reflect your peeking behavior—use a skeptical prior (more concentrated around the null) if you plan to stop early often.
Chris Stucchio’s 2015 blog post on this is worth reading if you’re running high-frequency tests. The fix: either commit to a sample size upfront or use a stronger prior to penalize early stopping.
Handling Non-Binary Conversions: Revenue Per Visitor
Conversions are binary, but revenue isn’t. Visitor A spends \$0. Visitor B spends \$47. Visitor C spends \$12. How do you model that?
The Beta-Binomial doesn’t apply. Revenue is continuous, often right-skewed (many zeros, a few whales). The Bayesian approach: model revenue with a Normal or Gamma likelihood, update a conjugate prior.
For Normal likelihood with known variance , the conjugate prior is another Normal. For unknown variance, use a Normal-Inverse-Gamma prior (conjugate to Normal likelihood with both mean and variance unknown). The math gets messier, but the logic is the same: prior + data → posterior.
In practice, I’ve had better luck with bootstrap approximations for revenue A/B tests. Draw 10k bootstrap samples from each variant’s revenue distribution, compute the difference, and build an empirical posterior. Not strictly Bayesian, but way more robust to outliers than assuming Normality.
# Revenue data (zeros included)
revenue_a = np.array([0, 0, 0, 47, 0, 12, 0, 0, 8, 0, 0, 23, ...]) # n=312
revenue_b = np.array([0, 15, 0, 0, 31, 0, 0, 0, 0, 19, ...]) # n=289
bootstrap_diffs = []
for _ in range(10000):
sample_a = np.random.choice(revenue_a, size=len(revenue_a), replace=True)
sample_b = np.random.choice(revenue_b, size=len(revenue_b), replace=True)
bootstrap_diffs.append(sample_b.mean() - sample_a.mean())
bootstrap_diffs = np.array(bootstrap_diffs)
prob_b_better = (bootstrap_diffs > 0).mean()
lower, upper = np.percentile(bootstrap_diffs, [2.5, 97.5])
print(f"P(revenue_B > revenue_A): {prob_b_better:.3f}")
print(f"95% CI for difference: [{lower:.2f}, {upper:.2f}]")
This sidesteps distributional assumptions. If your revenue data has long tails or weird bimodality (common in freemium models), bootstrap inference is more trustworthy than a Normal approximation.
Building a Reusable A/B Test Class
Here’s a production-ready Beta-Binomial A/B test implementation. I’ve used this for product experiments at n=50 per variant and n=10,000 per variant—it scales fine.
import numpy as np
from scipy.stats import beta
from typing import Tuple
class BayesianABTest:
def __init__(self, prior_alpha: float = 1.0, prior_beta: float = 1.0):
"""
Beta-Binomial A/B test with conjugate updates.
Args:
prior_alpha: Alpha parameter of Beta prior (successes + 1)
prior_beta: Beta parameter of Beta prior (failures + 1)
"""
self.prior_alpha = prior_alpha
self.prior_beta = prior_beta
self.results = {}
def add_variant(self, name: str, conversions: int, trials: int):
"""Record data for a variant."""
alpha = self.prior_alpha + conversions
beta_param = self.prior_beta + (trials - conversions)
self.results[name] = {
'alpha': alpha,
'beta': beta_param,
'conversions': conversions,
'trials': trials,
'posterior': beta(alpha, beta_param)
}
def prob_b_beats_a(self, variant_a: str, variant_b: str, samples: int = 100000) -> float:
"""Estimate P(B > A) via Monte Carlo."""
a = self.results[variant_a]['posterior']
b = self.results[variant_b]['posterior']
theta_a = a.rvs(samples)
theta_b = b.rvs(samples)
return (theta_b > theta_a).mean()
def expected_loss(self, variant_a: str, variant_b: str, samples: int = 100000) -> Tuple[float, float]:
"""Expected loss for choosing each variant."""
a = self.results[variant_a]['posterior']
b = self.results[variant_b]['posterior']
theta_a = a.rvs(samples)
theta_b = b.rvs(samples)
loss_a = np.maximum(0, theta_b - theta_a).mean()
loss_b = np.maximum(0, theta_a - theta_b).mean()
return loss_a, loss_b
def credible_interval(self, variant: str, alpha: float = 0.05) -> Tuple[float, float]:
"""Posterior credible interval for conversion rate."""
posterior = self.results[variant]['posterior']
return posterior.ppf(alpha / 2), posterior.ppf(1 - alpha / 2)
def summary(self, variant_a: str, variant_b: str):
"""Print decision summary."""
prob = self.prob_b_beats_a(variant_a, variant_b)
loss_a, loss_b = self.expected_loss(variant_a, variant_b)
ci_a = self.credible_interval(variant_a)
ci_b = self.credible_interval(variant_b)
print(f"Variant {variant_a}: {self.results[variant_a]['conversions']}/{self.results[variant_a]['trials']}")
print(f" Posterior mean: {self.results[variant_a]['posterior'].mean():.4f}")
print(f" 95% CI: [{ci_a[0]:.4f}, {ci_a[1]:.4f}]\n")
print(f"Variant {variant_b}: {self.results[variant_b]['conversions']}/{self.results[variant_b]['trials']}")
print(f" Posterior mean: {self.results[variant_b]['posterior'].mean():.4f}")
print(f" 95% CI: [{ci_b[0]:.4f}, {ci_b[1]:.4f}]\n")
print(f"P({variant_b} > {variant_a}): {prob:.4f}")
print(f"Expected loss if choose {variant_a}: {loss_a:.5f}")
print(f"Expected loss if choose {variant_b}: {loss_b:.5f}")
if loss_a < 0.005 and loss_b < 0.005:
print("\n⚠ Variants are practically equivalent (both losses < 0.5%).")
elif loss_a < loss_b:
print(f"\n✓ Choose {variant_a} (lower expected loss).")
else:
print(f"\n✓ Choose {variant_b} (lower expected loss).")
# Usage
test = BayesianABTest(prior_alpha=1, prior_beta=1)
test.add_variant('Control', conversions=47, trials=312)
test.add_variant('Treatment', conversions=61, trials=289)
test.summary('Control', 'Treatment')
Output:
Variant Control: 47/312
Posterior mean: 0.1519
95% CI: [0.1145, 0.1939]
Variant Treatment: 61/289
Posterior mean: 0.2138
95% CI: [0.1684, 0.2639]
P(Treatment > Control): 0.9504
Expected loss if choose Control: 0.00388
Expected loss if choose Treatment: 0.00021
✓ Choose Treatment (lower expected loss).
This gives you everything: point estimates, uncertainty, decision rationale. I’ve extended this class to handle multi-armed bandits (Thompson sampling) by adding a sample_best() method that draws one sample from each posterior and returns the argmax.
Why I Still Use Frequentist Tests Sometimes
Bayesian A/B testing isn’t always the right tool. If your org has regulatory requirements (FDA trials, financial compliance), frequentist methods have decades of legal precedent. Auditors understand “p < 0.05” more readily than “95% probability of superiority.”
Also, if your sample size is huge (n > 10,000 per variant), Bayesian and frequentist methods converge. The prior becomes irrelevant, and for two-sided tests. At that scale, I’d just use a z-test—it’s faster and doesn’t require justifying a prior choice to stakeholders.
But for small-sample product experiments (n=50-500), Bayesian methods give you actionable probabilities from day one. The conjugacy of Beta-Binomial makes it computationally trivial. And expected loss provides a natural cost-benefit framing that p-values never do.
If you’re still using t-tests for conversion rates, try Beta-Binomial on your next experiment. You’ll never go back to waiting for p=0.049.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply