Paper Review: Chinchilla (Training Compute-Optimal Large Language Models) — Why Most LLMs Were Massively Overtrained on Too Few Tokens

⚡ Key Takeaways
  • Chinchilla showed that parameters and training data should scale equally — roughly 20 tokens per parameter — overturning the previous belief that bigger models always win.
  • A 70B parameter Chinchilla model outperformed the 280B Gopher on 65 of 67 benchmarks using the same compute budget, proving data scaling was massively undervalued.
  • The earlier Kaplan et al. scaling laws were biased by not training smaller models to convergence, leading the field to build oversized, data-starved models for years.
  • The practical impact extends beyond training: smaller compute-optimal models are 4x cheaper to serve at inference, directly enabling models like LLaMA that run on consumer hardware.
  • The 20:1 ratio is a useful starting point but not universal — data quality, domain, and inference cost considerations can shift the optimal allocation significantly.

The Punchline That Rewrote LLM Training Budgets

Before Hoffmann et al. dropped this paper in March 2022, the prevailing wisdom was simple: bigger model, better results. GPT-3 had 175B parameters trained on 300B tokens. Gopher pushed to 280B parameters on 300B tokens. The scaling laws from Kaplan et al. (2020) at OpenAI suggested that when you have more compute, you should primarily scale model size while keeping data roughly constant.

Chinchilla proved that strategy was leaving massive performance on the table.

You can read the full paper here.

The core finding: a 70B parameter model trained on 1.4 trillion tokens — roughly 4x fewer parameters than Gopher but 4.6x more data — outperformed Gopher on basically every benchmark. Same compute budget. Dramatically better results. That’s not a marginal improvement; it’s a “we’ve been doing this wrong” moment for the entire field.

Elegant legal office with a close-up of golden scales of justice on a sleek dark desk.
Photo by KATRIN BOLOVTSOVA on Pexels

What the Paper Actually Claims

The central thesis is deceptively simple. For a given compute budget CC, there’s an optimal ratio between model parameters NN and training tokens DD. The relationship the authors found is approximately:

NoptC0.50,DoptC0.50N_{opt} \propto C^{0.50}, \quad D_{opt} \propto C^{0.50}

In plain terms: parameters and data should scale equally. If you double your compute, you should increase both model size and dataset size by roughly 2\sqrt{2}, not dump everything into parameters.

This directly contradicts the earlier Kaplan et al. scaling laws, which suggested NoptC0.73N_{opt} \propto C^{0.73} — meaning compute should be allocated overwhelmingly toward bigger models. The practical gap between these two exponents is enormous. Under the Kaplan prescription, a 10x increase in compute means ~5.4x bigger model. Under Chinchilla’s prescription, it means ~3.2x bigger model but also ~3.2x more data.

The loss prediction follows a power law:

L(N,D)=ANα+BDβ+EL(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

where EE is the irreducible loss (entropy of natural language), and AA, BB, α\alpha, β\beta are fitted constants. The authors estimated α0.34\alpha \approx 0.34 and β0.28\beta \approx 0.28, though these varied across their three estimation approaches. This functional form says that loss decreases as a power law in both model size and data size, with diminishing returns in each, and the optimal allocation minimizes LL subject to the constraint C6NDC \approx 6ND (the standard FLOPs approximation for transformer training).

Three Approaches, One Answer

What makes this paper methodologically interesting — and honestly a bit unusual — is that the authors used three completely independent approaches to estimate optimal scaling, and they all converged on roughly the same answer.

Approach 1 fixed compute budgets across a range (from 6×10186 \times 10^{18} to 3×10213 \times 10^{21} FLOPs) and trained multiple models of varying sizes for each budget. For each compute level, they found the model size that minimized loss. Then they fit a parametric curve through those optimal points. This is the most direct approach but also the most expensive — they trained over 400 models.

Approach 2 took the loss data from those same 400+ runs and fit the full L(N,D)L(N, D) parametric function, then derived the optimal allocation analytically by solving the constrained optimization. Specifically, minimizing L(N,D)L(N, D) subject to FLOPs(N,D)=C\text{FLOPs}(N, D) = C using Lagrange multipliers gives you the optimal NN and DD as functions of CC.

Approach 3 fit the loss function to the data but used a slightly different functional form — modeling the loss as a function of NN separately from DD — and derived optimal allocations from those fits.

All three approaches gave a tokens-to-parameters ratio somewhere around 20:1. The Chinchilla model itself used exactly this ratio: 70B parameters, 1.4T tokens, so D/N=20D/N = 20.

Compare that to Gopher: 280B parameters, 300B tokens, giving D/N1.07D/N \approx 1.07. That model was overtrained by a factor of roughly 20x in terms of parameter count relative to its data budget. Or equivalently, undertrained by about 20x on data.

Why Kaplan Got It Wrong (Sort Of)

This is the part that I find most interesting, and where the paper gets a little tricky.

Kaplan et al. (2020) didn’t make a mathematical error. Their scaling laws were fit to real data and the fits were good. The issue is more subtle: they used a fixed learning rate schedule across model sizes, and they didn’t train to convergence. When you don’t let smaller models train long enough, they look worse than they actually are, which biases your scaling law toward preferring larger models.

The Chinchilla team used a cosine learning rate schedule tuned per run, with the learning rate decay aligned to the total training duration. This is a seemingly minor methodological choice that completely changes the optimal scaling prescription. It’s a good reminder that scaling laws aren’t fundamental physics — they’re empirical fits that depend heavily on training details.

I’m not entirely sure how sensitive the exact 20:1 ratio is to other hyperparameter choices like batch size scheduling, warmup length, or optimizer (they used AdamW throughout). The paper doesn’t extensively ablate these factors, and my best guess is that the ratio could shift meaningfully — maybe anywhere from 15:1 to 30:1 — depending on the training recipe. Later work from different groups has indeed suggested somewhat different optimal ratios.

The Benchmark Results That Matter

Here’s where Chinchilla flexes. Despite being 4x smaller than Gopher (70B vs 280B), it wins on 65 out of 67 evaluation tasks.

A few numbers that stand out:

Benchmark Gopher (280B) Chinchilla (70B)
MMLU (5-shot) 60.0% 67.6%
HellaSwag 79.2% 80.8%
LAMBADA 74.5% 77.4%
BoolQ 79.3% 83.7%

The MMLU jump from 60.0% to 67.6% is significant — that’s 7.6 percentage points from a model that’s 4x smaller and cheaper to serve at inference time. Chinchilla was actually the first model to surpass human rater performance on average across MMLU categories, hitting 67.6% compared to the ~65% average for human domain experts (though that comparison is contentious and depends heavily on which experts you sample).

But here’s the thing that should catch a practitioner’s eye: Chinchilla’s advantage is most pronounced on knowledge-heavy tasks (MMLU, BoolQ) and less dramatic on pure language modeling metrics. On raw bits-per-byte over held-out text, the gap narrows. This suggests that additional training data mostly helps with absorbing factual knowledge, which makes intuitive sense.

Close-up of Lady Justice statue holding scales, symbolizing justice and fairness.
Photo by deep Bhullar on Pexels

What This Means For Practitioners (The Compute Cost Angle)

The most immediate practical implication: inference cost.

A Chinchilla-optimal 70B model gives you better quality than a 280B model, and serving a 70B model costs roughly 4x less in memory and FLOPs per token. That’s a massive win for anyone deploying these models in production. The training compute is the same either way, but you’re buying cheaper inference forever after.

This is probably why the paper had such an outsized impact on the field. It wasn’t just a theoretical observation — it directly told every lab “your deployed models are bigger than they need to be, and you could get better results with smaller, longer-trained models.”

The LLaMA family from Meta (Touvron et al., 2023) arguably took this lesson to heart, training 7B/13B/33B/65B models on 1.0–1.4T tokens. LLaMA-13B, a model you can run on a single GPU, outperformed GPT-3 (175B) on most benchmarks. Without the Chinchilla paper establishing that data scaling matters as much as parameter scaling, I doubt we’d have seen that particular set of design choices.

The Data Wall Problem Nobody Wanted to Talk About

There’s an uncomfortable corollary to Chinchilla’s findings. If the optimal ratio is ~20 tokens per parameter, then training a 1 trillion parameter model “optimally” requires 20 trillion tokens of training data. For context, the total size of the publicly crawlable English web is estimated at somewhere around 5-10 trillion tokens (depending on how you count and filter).

We’re already running into data constraints for the largest models. This has driven a lot of the interest in synthetic data generation, multi-epoch training (training on the same data multiple times), and data quality filtering as a way to get more effective tokens out of the same raw data. The paper briefly acknowledges this but doesn’t really engage with it — which is fair, since it wasn’t their research question, but it’s the elephant in the room that their result creates.

Some more recent work (Muennighoff et al., 2023) has looked at multi-epoch training and found that repeating data up to ~4 epochs barely hurts, but beyond that, returns diminish. That’s a partial relief but doesn’t fundamentally solve the problem for frontier-scale models.

Implementation Details That Would Trip You Up

If you actually wanted to reproduce or apply Chinchilla scaling, here are a few things worth knowing.

The FLOPs6ND\text{FLOPs} \approx 6ND approximation is just that — an approximation. It counts the forward and backward pass through the dense layers but ignores embedding layers, attention computation costs (which scale as O(s2d)O(s^2 d) for sequence length ss and model dimension dd), and overhead from things like layer norm and residual connections. For large enough models with reasonable sequence lengths (~2048 tokens as used in the paper), the approximation is pretty good. But if you’re working with long-context models (32k+ tokens), the attention FLOPs become non-negligible and the 6ND estimate breaks down.

The learning rate schedule matters enormously. The paper uses a cosine schedule with warmup, and they tune the peak learning rate per model size. Here’s the kind of thing you’d have to get right:

import math

def chinchilla_lr_schedule(step, total_steps, peak_lr, warmup_steps=2000):
    """Cosine decay LR schedule similar to what Chinchilla used."""
    if step < warmup_steps:
        return peak_lr * step / warmup_steps

    # cosine decay to 10% of peak (not zero)
    progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
    min_lr = peak_lr * 0.1  # they decayed to ~10x lower
    return min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))

# For a 70B model, the peak LR was around 1e-4
# For smaller models (1B range), it was closer to 2e-4
# Getting this wrong by 2x can meaningfully shift your scaling curves

for step in [0, 1000, 50000, 100000, 149000]:
    lr = chinchilla_lr_schedule(step, total_steps=150000, peak_lr=1e-4)
    print(f"Step {step:>7d}: LR = {lr:.2e}")

Output:

Step       0: LR = 0.00e+00
Step    1000: LR = 5.00e-05
Step   50000: LR = 8.27e-05
Step  100000: LR = 2.73e-05
Step  149000: LR = 1.00e-05

The final learning rate decays to about 10% of peak, not all the way to zero. This is a detail that’s easy to miss if you’re just reading the abstract.

Another practical consideration: the optimal ratio of ~20 tokens per parameter assumes you’re training from scratch with random initialization. If you’re doing continued pre-training or fine-tuning, the picture changes completely. LoRA and similar parameter-efficient methods (as I covered in a previous review on this blog) operate in a totally different regime where you might train on only a few billion tokens with adapters that are 0.1% of the model size.

The Ablation I Wish They’d Done

The paper’s three estimation approaches are thorough, but there’s one ablation I really wanted to see: how does the optimal ratio change across different data distributions?

All their experiments used MassiveText, DeepMind’s curated web corpus. But what if you’re training on code-heavy data? Or scientific papers? Or multilingual text? The information density per token varies hugely across these domains. Code tokens are arguably more information-dense than conversational web text, which might mean you need fewer tokens per parameter for a code model. Conversely, multilingual data might require more tokens because the model needs to learn multiple languages’ worth of patterns.

I haven’t seen definitive answers on this, though the Starcoder models (Li et al., 2023) and Code LLaMA suggest that code models do tend to train on somewhat fewer tokens relative to their size compared to what Chinchilla would predict for general text. Take that with a grain of salt — those models also have different tokenizers, vocabulary sizes, and training objectives that complicate any direct comparison.

Where the Field Has Gone Since

Chinchilla was published in early 2022, and the field has already moved in directions the paper didn’t fully anticipate.

First, the “inference cost matters more than training cost” argument has grown stronger, not weaker. For a model that serves millions of users, inference FLOPs dominate total lifetime compute by orders of magnitude. This actually pushes the optimal size even smaller than Chinchilla predicts, because you want to “overtrain” a smaller model (train it for longer than compute-optimal) to get a model that’s cheap to serve. LLaMA 2 explicitly does this — Meta trained their 70B model on 2T tokens (D/N29D/N \approx 29), intentionally overshooting the Chinchilla-optimal data amount because the resulting model is better per inference FLOP.

Total lifetime cost=Ctrain+Cinference×Q\text{Total lifetime cost} = C_{\text{train}} + C_{\text{inference}} \times Q

where QQ is the total number of queries served. As QQ \to \infty, minimizing CinferenceC_{\text{inference}} dominates, and you want the smallest model that hits your quality bar — trained on as much data as possible.

Second, mixture-of-experts models (like the Switch Transformer approach covered in a previous review on this blog) partially sidestep the Chinchilla scaling question. MoE models have more total parameters but only activate a fraction per token, which means the effective parameter count for scaling purposes is ambiguous. How do you count NN for a model with 1.6T total parameters but only 128B active per forward pass?

And third, data quality has emerged as potentially more important than raw data quantity. The Phi series from Microsoft showed that carefully curated “textbook-quality” data can make small models (1.3B parameters) punch dramatically above their weight class. This doesn’t invalidate Chinchilla — it just says that not all tokens are created equal, which the scaling laws implicitly assume.

Should You Care About This Paper in 2026?

Absolutely, but with caveats.

The directional insight — data scaling matters as much as parameter scaling — has held up completely and reshaped how every major lab trains models. If you’re training your own models (even small ones), the 20:1 ratio is a solid starting heuristic for your data budget. Training a 1B model? Aim for at least 20B tokens. Training a 7B model? You want 100B+ tokens.

But don’t treat the exact numbers as gospel. The 20:1 ratio was derived under specific conditions (single-epoch training, MassiveText distribution, cosine LR, specific optimizer settings) and the true optimal ratio for your setup could be anywhere from 15:1 to 50:1 depending on data quality, domain, and whether you prioritize training efficiency or inference efficiency.

What I’m still curious about: how do these scaling laws interact with post-training alignment? A model that’s compute-optimal for raw language modeling loss might not be compute-optimal for downstream helpfulness after RLHF or DPO. Nobody has convincingly shown whether Chinchilla-optimal pre-training is also Chinchilla-optimal for the final aligned model. My gut says the answer is “close enough,” but I’d love to see the actual data.

References

  • Hoffmann, J., Borgeaud, S., Mensch, A., et al. “Training Compute-Optimal Large Language Models.” NeurIPS 2022. arXiv:2203.15556
  • Kaplan, J., McCandlish, S., Henighan, T., et al. “Scaling Laws for Neural Language Models.” 2020. arXiv:2001.08361
  • Touvron, H., Lavril, T., Izcard, G., et al. “LLaMA: Open and Efficient Foundation Language Models.” 2023. arXiv:2302.13971
  • Muennighoff, N., Rush, A., Barak, B., et al. “Scaling Data-Constrained Language Models.” NeurIPS 2023. arXiv:2305.16264

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 391 | TOTAL 2,614