Scaling Laws for LLMs: The 3 Knobs You Actually Have
"Bigger model" is not a scaling strategy. Scaling is a 3-knob system: parameters, tokens, and compute. If you crank only one knob, you hit a bottleneck and scaling looks dead.
1. Why This Matters
There are two types of scaling discourse. The first says: "just add parameters, bigger model equals bigger brain." The second says: "scaling is dead, gains are slowing, we hit the wall." Both are partially right, and both miss the same boring truth.
Scaling works when you scale the right things together. If you scale the wrong thing in isolation, you hit a bottleneck and it feels like the laws of physics stopped working.
This post is an intuition-first map of scaling laws for LLMs: the stuff that lets you predict returns from scale, and the stuff that explains why a giant model can still be weirdly mediocre.
2. The Power Law Picture
If you plot "how surprised the model is by text" (cross-entropy loss) against "scale" on a log-log plot, you often see something close to a straight line.
A straight line on a log-log plot means a power law. Each doubling of scale gives you a predictable improvement, but the improvement shrinks as you go. Diminishing returns are not a bug; they are the shape of the curve.
So yes, scaling works. But the part people forget: scale is not one number. There are three independent axes you can push, and pushing only one leads to a wall.
3. The 3 Knobs
During pretraining, you have exactly three practical knobs:
| Knob | What it is | Analogy |
|---|---|---|
| Parameters ($N$) | Number of trainable weights in the model | Brain size |
| Tokens ($D$) | Amount of text the model reads during training | Books read |
| Compute ($C$) | Total budget: roughly $C \approx 6ND$ FLOPs | Time and hardware |
These three are not independent. Compute is roughly the product of how big the model is and how much data it processes: $C \approx 6ND$. So a fixed compute budget forces a tradeoff: bigger model with fewer tokens, or smaller model with more tokens.
The classic trap: scaling parameters without scaling tokens
A big model with not enough tokens is like hiring a PhD and giving them two blog posts to read. They might memorize those posts perfectly, but they will not develop taste or generalization.
When people say "scaling slowed," a lot of the time they mean: we scaled parameters faster than we scaled tokens. The model became data-limited, not scale-limited.
4. The Math (Kept Simple)
Kaplan et al. (2020) first characterized systematic power-law relationships between loss and scale. They showed that when you vary one axis at a time (holding the other sufficient), loss follows clean power laws in $N$ alone or $D$ alone.
A useful joint form, introduced by Hoffmann et al. (2022) (Chinchilla), decomposes loss into additive terms:
Where:
- $E$ is the irreducible loss: the floor you cannot beat with any amount of scale (entropy of the data itself).
- $N$ is the number of parameters and $\alpha$ controls how fast bigger models help.
- $D$ is the number of training tokens and $\beta$ controls how fast more data helps.
- $A, B$ are constants that depend on the architecture, data distribution, and training recipe.
Read the equation like this: there is a floor you cannot beat. More parameters help, but with diminishing returns. More tokens help, but with diminishing returns. If you fix one and scale the other, you eventually get stuck against one of the two power-law walls.
5. The Chinchilla Insight
Given a fixed compute budget $C$, how should you split it between parameters and tokens? This is the question Hoffmann et al. (2022) (the "Chinchilla" paper) answered.
Their headline finding, stated as a rule of thumb: if you double parameters, you should roughly double training tokens too. More precisely, for compute-optimal training, the number of tokens should scale linearly with parameters: $D_{\text{opt}} \propto N$.
The memorable demonstration: Chinchilla (70B parameters, 1.4T tokens) outperformed Gopher (280B parameters, 300B tokens) at similar compute. A model 4x smaller, trained on ~5x more data, won.
| Model | Parameters | Tokens | Tokens / Param ratio | Result |
|---|---|---|---|---|
| Gopher | 280B | 300B | ~1 | Undertrained |
| Chinchilla | 70B | 1.4T | ~20 | Better loss at similar compute |
The lesson: many large models before Chinchilla were undertrained. They spent compute on parameters without feeding the model enough experience. You can waste compute by over-investing in model size.
6. Beyond Pretraining Loss
The original scaling laws focused on pretraining loss. But the field has moved on. There are now scaling results for several other dimensions.
Data mixture scaling: which tokens matter
Not all tokens are equal. The mix of domains in your training data (code, math, web text, multilingual) shifts the loss curve. More code might help reasoning. More math text changes behavior. "More data" becomes "more data in the right proportions."
Apple (2025) showed that you can derive scaling laws for optimal data mixtures: given a fixed token budget, the optimal proportion of each domain depends on the model size and the target task distribution.
Multilingual scaling: transfer is not free
Adding languages is not just adding data. Languages interfere with and transfer to each other. Some pairs help (Spanish and Portuguese share structure), some fight (unrelated scripts can compete for capacity).
Google's ATLAS work (2026) provides practical scaling laws for multilingual models, showing that the return from adding a language depends on how much capacity the model has and how related the new language is to existing ones.
Test-time scaling: spend compute at inference
Sometimes you do not train a larger model. Instead, you let the model "think longer" at inference: sample multiple solutions, verify each, pick the best, or run a search process.
This is still scaling. It is scaling inference compute instead of training compute. And it follows its own power-law-like curves: more test-time compute yields better results, with diminishing returns.
Reference: Chen et al. (2024), "Provable Scaling Laws for the Test-Time Compute of Large Language Models."
7. What Scaling Laws Cannot Promise
Scaling laws are good at predicting smooth, average quantities like cross-entropy loss. They are less reliable for the things people actually care about:
- "Will it suddenly learn to do task X?" (emergence)
- "Will it stop hallucinating?" (factuality)
- "Will it become aligned?" (safety)
You can observe threshold-like jumps on some benchmarks as models scale. Wei et al. (2022) called these emergent abilities: capabilities that appear abruptly at a certain scale. But Schaeffer et al. (2023) showed that many apparent emergence effects are artifacts of the metric. Switch from a sharp metric (exact-match accuracy) to a smooth one (log-likelihood), and the "sudden jump" often becomes a gradual curve that was always there.
The other important caveat: scaling laws are local to your recipe. Change the data cleaning pipeline, the training objective, the architecture, the tokenizer, or the learning rate schedule, and you move the entire curve. The laws describe what happens within a fixed recipe, not across recipes.
8. Cheatsheet
If you are compute-limited (fixed GPU budget)
- Do not overspend on parameters if you cannot afford enough tokens to train them properly.
- Invest in data quality and diversity earlier than your instincts want.
- Expect diminishing returns. Plan improvements as a curve, not a step function.
- Use the Chinchilla ratio ($D \approx 20N$) as a starting point, adjust for your serving constraints.
If you are inference-limited (serving cost matters)
- A smaller, well-trained model often wins on price-performance versus a larger, undertrained one.
- Consider overtraining: spend more on training compute now to get a smaller model that is cheaper to serve forever.
- Apply test-time scaling selectively. Spend extra inference compute on hard queries, not easy ones.
The one question that matters
The question is not "should we scale?" The question is:
References
- Kaplan et al. (2020). Scaling Laws for Neural Language Models.
- Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla).
- Apple ML Research (2025). Scaling Laws for Optimal Data Mixtures.
- Google Research (2026). ATLAS: Practical Scaling Laws for Multilingual Models.
- Chen et al. (2024). Provable Scaling Laws for the Test-Time Compute of Large Language Models.
- Wei et al. (2022). Emergent Abilities of Large Language Models.
- Schaeffer et al. (2023). Are Emergent Abilities of LLMs a Mirage?