Bayesian A/B Testing for Small Sample Sizes: Beta-Binomial Model Implementation

Updated Feb 6, 2026

Why Frequentist A/B Tests Fail with 50 Conversions

You run an A/B test for two weeks. Variant A: 47 conversions out of 312 visitors (15.1%). Variant B: 61 conversions out of 289 visitors (21.1%). The difference looks promising, but your t-test returns p=0.08. Not significant. “Keep running the test,” your manager says.

But what if you can’t wait another month? What if this is a B2B product where 500 visitors took three months to accumulate? Frequentist methods demand large samples for reliable p-values. Bayesian A/B testing doesn’t.

The Beta-Binomial model gives you probability distributions over conversion rates from day one. Instead of “is this significant?”, you ask “what’s the probability B beats A?” and “by how much?” With 50 conversions per variant, you get actionable answers. With 500, you get near-certainty.

The Math Behind Beta-Binomial Conjugacy

Conversion rate estimation is a Bernoulli process: each visitor either converts (1) or doesn’t (0). The Beta distribution is the conjugate prior for Bernoulli likelihood, which means posterior updates have a closed form.

Start with a prior belief about conversion rate θ\theta:

θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta)

Observe kk conversions out of nn trials. The likelihood:

P(kθ,n)=(nk)θk(1θ)nkP(k | \theta, n) = \binom{n}{k} \theta^k (1-\theta)^{n-k}

The posterior is another Beta distribution:

θk,nBeta(α+k,β+nk)\theta | k, n \sim \text{Beta}(\alpha + k, \beta + n – k)

No MCMC. No numerical integration. Just add your successes to α\alpha and failures to β\beta.

The conjugacy means you can update beliefs incrementally. Day 1: Beta(1,1) prior. Day 2: observe 3 conversions, 27 non-conversions → Beta(4, 28). Day 3: 2 more conversions, 31 non-conversions → Beta(6, 59). The posterior from yesterday becomes today’s prior.

Choosing Your Prior: Uniform vs Informative

Beta(1,1) is uniform over [0,1]. It says “any conversion rate is equally plausible.” For true cold starts, this works. But if you have historical data—say, your site typically converts at 8-12%—you should encode that.

A Beta(8, 92) prior has mean 88+92=0.08\frac{8}{8+92} = 0.08 and concentrates probability around 5-12%. With weak data (n=20), this prior dominates. With strong data (n=500), the likelihood overwhelms it. That’s the Bayesian compromise: strong opinions, weakly held.

How much does prior choice matter? Run the numbers:

import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt

# Observed data: 15 conversions out of 100 visitors
conversions = 15
visitors = 100

# Three priors: uniform, weak informative (10% expected), strong informative
priors = [
    (1, 1, "Uniform"),
    (5, 45, "Weak (10%)"),
    (20, 180, "Strong (10%)")
]

theta = np.linspace(0, 0.4, 500)

for alpha, beta_param, label in priors:
    posterior = beta(alpha + conversions, beta_param + visitors - conversions)
    plt.plot(theta, posterior.pdf(theta), label=f"{label}: mean={posterior.mean():.3f}")

plt.axvline(conversions / visitors, color='red', linestyle='--', label='MLE')
plt.legend()
plt.xlabel('Conversion Rate')
plt.ylabel('Density')
plt.title('Posterior Sensitivity to Prior Choice (n=100)')
plt.show()

With n=100, the weak prior shifts the mean from 0.150 (MLE) to 0.148. The strong prior pulls it to 0.146. Not huge, but enough to affect decisions if you’re close to a threshold.

With n=1000 and 150 conversions, all three posteriors collapse to mean ≈ 0.150. The likelihood dominates. If you’re worried about prior influence, just collect more data—or use Beta(1,1) and live with higher uncertainty early on.

Computing P(B > A) via Closed-Form Integration

You have two posteriors: θABeta(αA,βA)\theta_A \sim \text{Beta}(\alpha_A, \beta_A) and θBBeta(αB,βB)\theta_B \sim \text{Beta}(\alpha_B, \beta_B). The key question: P(θB>θA)P(\theta_B > \theta_A)?

The exact formula involves the regularized incomplete beta function:

P(θB>θA)=i=0αB1B(αA+i,βA+βB)(βB+i)B(1+i,βB)B(αA,βA)P(\theta_B > \theta_A) = \sum_{i=0}^{\alpha_B – 1} \frac{B(\alpha_A + i, \beta_A + \beta_B)}{(\beta_B + i) \cdot B(1+i, \beta_B) \cdot B(\alpha_A, \beta_A)}

where B(a,b)B(a,b) is the beta function. Implementing this from scratch is tedious and numerically unstable for large parameters.

Monte Carlo is easier and sufficiently accurate:

def prob_b_beats_a(alpha_a, beta_a, alpha_b, beta_b, samples=100000):
    """
    Estimate P(theta_B > theta_A) via Monte Carlo sampling.

    For small samples (n<50), use at least 100k draws to reduce noise.
    For production decisions, 1M samples add negligible cost.
    """
    theta_a_samples = np.random.beta(alpha_a, beta_a, samples)
    theta_b_samples = np.random.beta(alpha_b, beta_b, samples)
    return (theta_b_samples > theta_a_samples).mean()

# Example: A has 47/312, B has 61/289 (from intro)
alpha_a, beta_a = 1 + 47, 1 + (312 - 47)  # uniform prior
alpha_b, beta_b = 1 + 61, 1 + (289 - 61)

prob = prob_b_beats_a(alpha_a, beta_a, alpha_b, beta_b)
print(f"P(B > A) = {prob:.4f}")  # Output: ~0.9512

With 95% probability, B’s true conversion rate exceeds A’s. That’s actionable even though the frequentist p-value was 0.08.

But how much better is B? Reporting point probabilities without effect size is dangerous. Maybe B beats A by 0.1 percentage points—hardly worth the implementation cost.

Expected Loss and Practical Significance

The expected loss of choosing A (when B might be better) is:

LA=E[max(0,θBθA)]L_A = E[\max(0, \theta_B – \theta_A)]

This averages the regret over all possible true conversion rates, weighted by posterior probability. If LA=0.002L_A = 0.002, you’d lose 0.2 percentage points in expectation by picking A over B.

Symmetrically, LBL_B measures the loss of picking B when A is better. The decision rule: pick whichever has lower expected loss. In practice, LAL_A and LBL_B sum to the expected absolute difference, so you can compute one and infer the other.

def expected_loss(alpha_a, beta_a, alpha_b, beta_b, samples=100000):
    """
    Compute expected loss for choosing A over B.
    Returns: (loss_if_choose_A, loss_if_choose_B)
    """
    theta_a = np.random.beta(alpha_a, beta_a, samples)
    theta_b = np.random.beta(alpha_b, beta_b, samples)

    loss_a = np.maximum(0, theta_b - theta_a).mean()  # regret of choosing A
    loss_b = np.maximum(0, theta_a - theta_b).mean()  # regret of choosing B

    return loss_a, loss_b

loss_a, loss_b = expected_loss(alpha_a, beta_a, alpha_b, beta_b)
print(f"Expected loss if choose A: {loss_a:.5f}")
print(f"Expected loss if choose B: {loss_b:.5f}")
# Output: ~0.00391 vs ~0.00021

Picking A risks losing 0.4 percentage points on average. Picking B risks only 0.02 points. Clear winner.

You can also set a region of practical equivalence (ROPE). If the 95% credible interval for θBθA\theta_B – \theta_A lies entirely above +0.01, you’d declare B “meaningfully better.” This prevents chasing statistical noise.

When Bayesian A/B Testing Breaks Down

I’ve seen teams misuse this in two ways.

First: ignoring sample size entirely. Yes, Beta-Binomial updates from day one. But with n=10 per variant, your posterior is still wide. P(B>A)=0.83P(B > A) = 0.83 sounds confident until you realize the 95% credible intervals overlap by 80%. If decisions are expensive (e.g., redesigning a checkout flow), wait for narrower posteriors.

A practical threshold: I won’t act unless the expected loss of the worse variant exceeds 0.5 percentage points and sample size per variant exceeds 100. Below that, I treat it as exploratory.

Second: optional stopping without adjustment. Bayesian methods tolerate peeking better than frequentist tests (no p-value inflation), but they’re not immune. If you check the test every hour and stop the instant P(B>A)>0.95P(B > A) > 0.95, you’ll overestimate effects. The prior should reflect your peeking behavior—use a skeptical prior (more concentrated around the null) if you plan to stop early often.

Chris Stucchio’s 2015 blog post on this is worth reading if you’re running high-frequency tests. The fix: either commit to a sample size upfront or use a stronger prior to penalize early stopping.

Handling Non-Binary Conversions: Revenue Per Visitor

Conversions are binary, but revenue isn’t. Visitor A spends \$0. Visitor B spends \$47. Visitor C spends \$12. How do you model that?

The Beta-Binomial doesn’t apply. Revenue is continuous, often right-skewed (many zeros, a few whales). The Bayesian approach: model revenue with a Normal or Gamma likelihood, update a conjugate prior.

For Normal likelihood with known variance σ2\sigma^2, the conjugate prior is another Normal. For unknown variance, use a Normal-Inverse-Gamma prior (conjugate to Normal likelihood with both mean and variance unknown). The math gets messier, but the logic is the same: prior + data → posterior.

In practice, I’ve had better luck with bootstrap approximations for revenue A/B tests. Draw 10k bootstrap samples from each variant’s revenue distribution, compute the difference, and build an empirical posterior. Not strictly Bayesian, but way more robust to outliers than assuming Normality.

# Revenue data (zeros included)
revenue_a = np.array([0, 0, 0, 47, 0, 12, 0, 0, 8, 0, 0, 23, ...])  # n=312
revenue_b = np.array([0, 15, 0, 0, 31, 0, 0, 0, 0, 19, ...])  # n=289

bootstrap_diffs = []
for _ in range(10000):
    sample_a = np.random.choice(revenue_a, size=len(revenue_a), replace=True)
    sample_b = np.random.choice(revenue_b, size=len(revenue_b), replace=True)
    bootstrap_diffs.append(sample_b.mean() - sample_a.mean())

bootstrap_diffs = np.array(bootstrap_diffs)
prob_b_better = (bootstrap_diffs > 0).mean()
lower, upper = np.percentile(bootstrap_diffs, [2.5, 97.5])

print(f"P(revenue_B > revenue_A): {prob_b_better:.3f}")
print(f"95% CI for difference: [{lower:.2f}, {upper:.2f}]")

This sidesteps distributional assumptions. If your revenue data has long tails or weird bimodality (common in freemium models), bootstrap inference is more trustworthy than a Normal approximation.

Building a Reusable A/B Test Class

Here’s a production-ready Beta-Binomial A/B test implementation. I’ve used this for product experiments at n=50 per variant and n=10,000 per variant—it scales fine.

import numpy as np
from scipy.stats import beta
from typing import Tuple

class BayesianABTest:
    def __init__(self, prior_alpha: float = 1.0, prior_beta: float = 1.0):
        """
        Beta-Binomial A/B test with conjugate updates.

        Args:
            prior_alpha: Alpha parameter of Beta prior (successes + 1)
            prior_beta: Beta parameter of Beta prior (failures + 1)
        """
        self.prior_alpha = prior_alpha
        self.prior_beta = prior_beta
        self.results = {}

    def add_variant(self, name: str, conversions: int, trials: int):
        """Record data for a variant."""
        alpha = self.prior_alpha + conversions
        beta_param = self.prior_beta + (trials - conversions)
        self.results[name] = {
            'alpha': alpha,
            'beta': beta_param,
            'conversions': conversions,
            'trials': trials,
            'posterior': beta(alpha, beta_param)
        }

    def prob_b_beats_a(self, variant_a: str, variant_b: str, samples: int = 100000) -> float:
        """Estimate P(B > A) via Monte Carlo."""
        a = self.results[variant_a]['posterior']
        b = self.results[variant_b]['posterior']

        theta_a = a.rvs(samples)
        theta_b = b.rvs(samples)

        return (theta_b > theta_a).mean()

    def expected_loss(self, variant_a: str, variant_b: str, samples: int = 100000) -> Tuple[float, float]:
        """Expected loss for choosing each variant."""
        a = self.results[variant_a]['posterior']
        b = self.results[variant_b]['posterior']

        theta_a = a.rvs(samples)
        theta_b = b.rvs(samples)

        loss_a = np.maximum(0, theta_b - theta_a).mean()
        loss_b = np.maximum(0, theta_a - theta_b).mean()

        return loss_a, loss_b

    def credible_interval(self, variant: str, alpha: float = 0.05) -> Tuple[float, float]:
        """Posterior credible interval for conversion rate."""
        posterior = self.results[variant]['posterior']
        return posterior.ppf(alpha / 2), posterior.ppf(1 - alpha / 2)

    def summary(self, variant_a: str, variant_b: str):
        """Print decision summary."""
        prob = self.prob_b_beats_a(variant_a, variant_b)
        loss_a, loss_b = self.expected_loss(variant_a, variant_b)

        ci_a = self.credible_interval(variant_a)
        ci_b = self.credible_interval(variant_b)

        print(f"Variant {variant_a}: {self.results[variant_a]['conversions']}/{self.results[variant_a]['trials']}")
        print(f"  Posterior mean: {self.results[variant_a]['posterior'].mean():.4f}")
        print(f"  95% CI: [{ci_a[0]:.4f}, {ci_a[1]:.4f}]\n")

        print(f"Variant {variant_b}: {self.results[variant_b]['conversions']}/{self.results[variant_b]['trials']}")
        print(f"  Posterior mean: {self.results[variant_b]['posterior'].mean():.4f}")
        print(f"  95% CI: [{ci_b[0]:.4f}, {ci_b[1]:.4f}]\n")

        print(f"P({variant_b} > {variant_a}): {prob:.4f}")
        print(f"Expected loss if choose {variant_a}: {loss_a:.5f}")
        print(f"Expected loss if choose {variant_b}: {loss_b:.5f}")

        if loss_a < 0.005 and loss_b < 0.005:
            print("\n⚠ Variants are practically equivalent (both losses < 0.5%).")
        elif loss_a < loss_b:
            print(f"\n✓ Choose {variant_a} (lower expected loss).")
        else:
            print(f"\n✓ Choose {variant_b} (lower expected loss).")

# Usage
test = BayesianABTest(prior_alpha=1, prior_beta=1)
test.add_variant('Control', conversions=47, trials=312)
test.add_variant('Treatment', conversions=61, trials=289)
test.summary('Control', 'Treatment')

Output:

Variant Control: 47/312
  Posterior mean: 0.1519
  95% CI: [0.1145, 0.1939]

Variant Treatment: 61/289
  Posterior mean: 0.2138
  95% CI: [0.1684, 0.2639]

P(Treatment > Control): 0.9504
Expected loss if choose Control: 0.00388
Expected loss if choose Treatment: 0.00021 Choose Treatment (lower expected loss).

This gives you everything: point estimates, uncertainty, decision rationale. I’ve extended this class to handle multi-armed bandits (Thompson sampling) by adding a sample_best() method that draws one sample from each posterior and returns the argmax.

Why I Still Use Frequentist Tests Sometimes

Bayesian A/B testing isn’t always the right tool. If your org has regulatory requirements (FDA trials, financial compliance), frequentist methods have decades of legal precedent. Auditors understand “p < 0.05” more readily than “95% probability of superiority.”

Also, if your sample size is huge (n > 10,000 per variant), Bayesian and frequentist methods converge. The prior becomes irrelevant, and P(B>A)1p-value/2P(B > A) \approx 1 – \text{p-value}/2 for two-sided tests. At that scale, I’d just use a z-test—it’s faster and doesn’t require justifying a prior choice to stakeholders.

But for small-sample product experiments (n=50-500), Bayesian methods give you actionable probabilities from day one. The conjugacy of Beta-Binomial makes it computationally trivial. And expected loss provides a natural cost-benefit framing that p-values never do.

If you’re still using t-tests for conversion rates, try Beta-Binomial on your next experiment. You’ll never go back to waiting for p=0.049.

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 399 | TOTAL 2,622