"Bigger model" is not a scaling strategy. Scaling is a 3-knob system: parameters, tokens, and compute. If you crank only one knob, you hit a bottleneck and scaling looks dead.

1. Why This Matters

There are two types of scaling discourse. The first says: "just add parameters, bigger model equals bigger brain." The second says: "scaling is dead, gains are slowing, we hit the wall." Both are partially right, and both miss the same boring truth.

Scaling works when you scale the right things together. If you scale the wrong thing in isolation, you hit a bottleneck and it feels like the laws of physics stopped working.

This post is an intuition-first map of scaling laws for LLMs: the stuff that lets you predict returns from scale, and the stuff that explains why a giant model can still be weirdly mediocre.

2. The Power Law Picture

If you plot "how surprised the model is by text" (cross-entropy loss) against "scale" on a log-log plot, you often see something close to a straight line.

A straight line on a log-log plot means a power law. Each doubling of scale gives you a predictable improvement, but the improvement shrinks as you go. Diminishing returns are not a bug; they are the shape of the curve.

So yes, scaling works. But the part people forget: scale is not one number. There are three independent axes you can push, and pushing only one leads to a wall.

3. The 3 Knobs

During pretraining, you have exactly three practical knobs:

KnobWhat it isAnalogy
Parameters ($N$) Number of trainable weights in the model Brain size
Tokens ($D$) Amount of text the model reads during training Books read
Compute ($C$) Total budget: roughly $C \approx 6ND$ FLOPs Time and hardware

These three are not independent. Compute is roughly the product of how big the model is and how much data it processes: $C \approx 6ND$. So a fixed compute budget forces a tradeoff: bigger model with fewer tokens, or smaller model with more tokens.

Pretraining is a 3-knob system Parameters (N) model capacity "brain size" risk: undertrained Tokens (D) training data volume "books read" risk: data exhaustion Compute (C) total budget C ≈ 6ND FLOPs risk: misallocated Loss ↓ / Capability ↑ improves smoothly with balanced scale Crank only one knob → bottleneck → "scaling is dead"

The classic trap: scaling parameters without scaling tokens

A big model with not enough tokens is like hiring a PhD and giving them two blog posts to read. They might memorize those posts perfectly, but they will not develop taste or generalization.

When people say "scaling slowed," a lot of the time they mean: we scaled parameters faster than we scaled tokens. The model became data-limited, not scale-limited.

Intuition: Think of $N$ (parameters) as the size of a notebook and $D$ (tokens) as the lectures you attend. A huge notebook is useless if you only go to two lectures. A small notebook fills up fast even with great lectures. Optimal learning requires matching notebook size to lecture count.

4. The Math (Kept Simple)

Kaplan et al. (2020) first characterized systematic power-law relationships between loss and scale. They showed that when you vary one axis at a time (holding the other sufficient), loss follows clean power laws in $N$ alone or $D$ alone.

A useful joint form, introduced by Hoffmann et al. (2022) (Chinchilla), decomposes loss into additive terms:

Scaling law (additive decomposition, Chinchilla):
$$L(N, D) \;\approx\; E \;+\; \frac{A}{N^{\alpha}} \;+\; \frac{B}{D^{\beta}}$$

Where:

Read the equation like this: there is a floor you cannot beat. More parameters help, but with diminishing returns. More tokens help, but with diminishing returns. If you fix one and scale the other, you eventually get stuck against one of the two power-law walls.

Intuition: The equation says loss is made of three parts: an unbeatable floor, a "model too small" tax, and a "not enough data" tax. Scaling is about reducing both taxes simultaneously. If you only reduce one, the other dominates and progress stalls.
Warning: The exponents $\alpha$ and $\beta$ are not universal constants. Kaplan's one-variable power laws report $\alpha_N \approx 0.076$ and $\alpha_D \approx 0.095$ for parameter-limited and data-limited scaling respectively. Chinchilla's main empirical fits of the additive decomposition yield $\alpha$ and $\beta$ both close to 0.5, though a different decomposition in the same paper gives $\alpha \approx 0.34$, $\beta \approx 0.28$. These vary because they belong to different functional forms with different fitting procedures. The qualitative story (two diminishing-returns terms plus a floor) is robust; the exact numbers are recipe-specific.

5. The Chinchilla Insight

Given a fixed compute budget $C$, how should you split it between parameters and tokens? This is the question Hoffmann et al. (2022) (the "Chinchilla" paper) answered.

Their headline finding, stated as a rule of thumb: if you double parameters, you should roughly double training tokens too. More precisely, for compute-optimal training, the number of tokens should scale linearly with parameters: $D_{\text{opt}} \propto N$.

The memorable demonstration: Chinchilla (70B parameters, 1.4T tokens) outperformed Gopher (280B parameters, 300B tokens) at similar compute. A model 4x smaller, trained on ~5x more data, won.

ModelParametersTokensTokens / Param ratioResult
Gopher 280B 300B ~1 Undertrained
Chinchilla 70B 1.4T ~20 Better loss at similar compute

The lesson: many large models before Chinchilla were undertrained. They spent compute on parameters without feeding the model enough experience. You can waste compute by over-investing in model size.

Intuition: Chinchilla said: stop buying brains, buy books too. A moderately-sized model that reads a lot outperforms a giant model that reads a little, at the same total cost.
Key limitation: Chinchilla-optimal ratios minimize training compute for a given loss. But in practice, inference cost matters too. A smaller model is cheaper to serve. So production models are often intentionally "overtrained" relative to Chinchilla (trained on more tokens than compute-optimal) because the savings at inference time outweigh the extra training cost. Llama models are a prominent example of this strategy.

6. Beyond Pretraining Loss

The original scaling laws focused on pretraining loss. But the field has moved on. There are now scaling results for several other dimensions.

Data mixture scaling: which tokens matter

Not all tokens are equal. The mix of domains in your training data (code, math, web text, multilingual) shifts the loss curve. More code might help reasoning. More math text changes behavior. "More data" becomes "more data in the right proportions."

Apple (2025) showed that you can derive scaling laws for optimal data mixtures: given a fixed token budget, the optimal proportion of each domain depends on the model size and the target task distribution.

Multilingual scaling: transfer is not free

Adding languages is not just adding data. Languages interfere with and transfer to each other. Some pairs help (Spanish and Portuguese share structure), some fight (unrelated scripts can compete for capacity).

Google's ATLAS work (2026) provides practical scaling laws for multilingual models, showing that the return from adding a language depends on how much capacity the model has and how related the new language is to existing ones.

Test-time scaling: spend compute at inference

Sometimes you do not train a larger model. Instead, you let the model "think longer" at inference: sample multiple solutions, verify each, pick the best, or run a search process.

This is still scaling. It is scaling inference compute instead of training compute. And it follows its own power-law-like curves: more test-time compute yields better results, with diminishing returns.

Compute budget: two places to spend Training compute bigger N, more D, longer training moves the base model curve one-time cost, amortized over queries Inference compute sample more, verify, search "think longer" per query per-query cost, applied selectively Performance = f(train compute, test compute) both follow power-law-like diminishing returns Reference: Chen et al. (2024), Provable Scaling Laws for Test-Time Compute (Chen et al., 2024)

Reference: Chen et al. (2024), "Provable Scaling Laws for the Test-Time Compute of Large Language Models."

7. What Scaling Laws Cannot Promise

Scaling laws are good at predicting smooth, average quantities like cross-entropy loss. They are less reliable for the things people actually care about:

You can observe threshold-like jumps on some benchmarks as models scale. Wei et al. (2022) called these emergent abilities: capabilities that appear abruptly at a certain scale. But Schaeffer et al. (2023) showed that many apparent emergence effects are artifacts of the metric. Switch from a sharp metric (exact-match accuracy) to a smooth one (log-likelihood), and the "sudden jump" often becomes a gradual curve that was always there.

Key limitation: Treat "emergent abilities" as an interesting observation, not a product roadmap. You cannot reliably predict when a specific capability will appear by extrapolating a loss curve. Scaling laws predict loss well. They predict specific task behavior poorly.

The other important caveat: scaling laws are local to your recipe. Change the data cleaning pipeline, the training objective, the architecture, the tokenizer, or the learning rate schedule, and you move the entire curve. The laws describe what happens within a fixed recipe, not across recipes.

8. Cheatsheet

If you are compute-limited (fixed GPU budget)

If you are inference-limited (serving cost matters)

The one question that matters

The question is not "should we scale?" The question is:

Which bottleneck are we hitting, and which knob removes it cheapest?

References

  1. Kaplan et al. (2020). Scaling Laws for Neural Language Models.
  2. Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla).
  3. Apple ML Research (2025). Scaling Laws for Optimal Data Mixtures.
  4. Google Research (2026). ATLAS: Practical Scaling Laws for Multilingual Models.
  5. Chen et al. (2024). Provable Scaling Laws for the Test-Time Compute of Large Language Models.
  6. Wei et al. (2022). Emergent Abilities of Large Language Models.
  7. Schaeffer et al. (2023). Are Emergent Abilities of LLMs a Mirage?