Test-Time Scaling: Spend Compute When It Matters

Training scaling has a well-known playbook: more parameters, more tokens, more compute. But there is a second scaling axis that works at inference time. You generate more, verify harder, and think longer. This post explains what that looks like, when it works, and where it breaks.

1. Why Test-Time Scaling

Traditional scaling invests compute at training time. You spend once, then serve the resulting model cheaply forever. The scaling laws post covered this: parameters, tokens, and compute form a 3-knob system with predictable power-law returns.

But training compute has two problems. First, it is a one-time decision. Once you have trained a 70B model, every query gets the same amount of "thought," whether it is "what is 2+2" or "prove this theorem." Second, training compute is amortized: the cost is justified only if you serve enough queries.

Test-time scaling flips this. Instead of a fixed model answering every question with the same effort, you spend more compute on harder questions and less on easy ones. The compute is per-query, applied selectively.

The headline result from Snell et al. (2024): with compute-optimal test-time allocation, a small model can match a 14x larger model under the same FLOP budget, on problems where the small model has non-trivial baseline accuracy.

2. The Big Picture

There are two paradigms for spending compute at inference. They are complementary, not competing.

Parallel scaling generates many independent solutions and picks the best one. Sequential scaling generates a single, longer solution with extended reasoning. In practice, you can combine both: generate multiple long-thinking attempts, then select the best.

3. Parallel Scaling: Sample and Select

The simplest form of test-time scaling: run the model $N$ times on the same prompt, collect $N$ candidate answers, pick the best one. This is best-of-N sampling.

The question is how you pick. There are several strategies, ranging from cheap to expensive.

Selection strategies

Strategy	How it works	Requires
Majority voting	Take the most common final answer	Nothing extra
Outcome reward model (ORM)	Score each final answer with a learned verifier	Trained reward model
Process reward model (PRM)	Score each reasoning step, select the solution with the highest step-level scores	Step-level reward model
Tournament (pairwise)	Have the LLM compare pairs of answers in rounds, like a knockout bracket	Nothing extra (just the LLM)
Execution-based	Run code against test cases, check formal proofs	External verifier (test suite, proof checker)

Intuition: Think of best-of-N as hiring 10 people to solve a problem independently, then choosing the best answer. You do not need a better employee; you need a better judge. The quality of your selection method determines how much value you extract from the samples.

How coverage scales

Brown et al. (2024) ("Large Language Monkeys") studied this systematically. They define coverage as the fraction of problems for which at least one of $k$ samples is correct. This is pass@k: the probability that you solve a problem if you get $k$ attempts and a perfect verifier.

Their key finding: coverage follows a log-linear relationship with the number of samples, modeled as an exponentiated power law. This relationship holds across four orders of magnitude.

The concrete example is striking. On SWE-bench Lite (a coding benchmark with automatic unit-test verification), DeepSeek-Coder-V2-Instruct went from 15.9% with a single sample to 56% with 250 samples. That is a 3.5x improvement from pure repeated sampling, beating the then-best single-sample systems.

But there is a catch: that 56% relies on benchmark unit tests to identify correct solutions. While these are automatic verifiers, they are imperfect: Brown et al. document flaky tests and false positives/negatives. Real-world selection methods like majority voting and reward models plateau even sooner, after several hundred samples. The gap between what automatic verification can extract and what a truly perfect oracle would extract is the verification gap.

Key limitation: Parallel scaling is only as good as your verifier. Generating more samples always increases coverage. But identifying which sample is correct becomes the bottleneck. For code, you can run tests. For math, you can check the final answer. For open-ended generation, verification is hard and remains an open problem.

4. Sequential Scaling: Think Longer

Instead of generating many short answers, generate one long answer with extended reasoning. This is what OpenAI's o1/o3 and DeepSeek-R1 do: the model produces a chain of internal "thinking" tokens before giving its final answer.

There are three ways to get a model to think longer.

4.1 RL-trained reasoning (DeepSeek-R1, OpenAI o1)

DeepSeek-R1-Zero demonstrated that pure reinforcement learning, without human-labeled reasoning trajectories, can teach a model to produce extended reasoning chains. The model learns to self-reflect, verify intermediate steps, and adapt its strategy during generation. The full DeepSeek-R1 builds on this by adding cold-start data and multiple SFT/RL stages to improve readability and consistency.

The training works like this: the model generates a reasoning trace plus a final answer. A reward signal (correct/incorrect) propagates back through the trace via RL. Over time, the model discovers that longer, more careful reasoning leads to higher rewards on hard problems. Behaviors like "wait, let me check that step" and "I made an error, let me reconsider" emerge naturally.

DeepSeek-R1 achieves near-parity with OpenAI o1 on AIME and MATH benchmarks, though results are mixed on other domains (R1 trails o1-1217 on GPQA and some code benchmarks like Aider). Notably, its reasoning patterns can be distilled to smaller models, making extended thinking accessible without full-scale RL training.

4.2 Budget forcing (s1)

Muennighoff et al. (2025) showed a remarkably simple approach. Take a base model (Qwen2.5-32B-Instruct), fine-tune it on just 1,000 curated reasoning examples (the s1K dataset), then control thinking at inference time with two tricks:

Truncation: forcefully stop the model's thinking when it tries to conclude too early.
Extension: append the word "Wait" to the generation, forcing the model to reconsider and often double-check its answer.

This "budget forcing" approach improved AIME24 scores from 50% to 57% and exceeded o1-preview by up to 27% on competition math. The training cost was minimal: 1,000 examples, standard supervised fine-tuning, no RL.

Intuition: Budget forcing is like telling a student "you have 30 more minutes, keep checking your work." The student does not get smarter; they just catch mistakes they would have missed if they rushed. The model already has the capability; budget forcing gives it the time to use it.

4.3 Process reward models + tree search

Instead of generating one long chain, you can search over partial reasoning chains. At each step, generate several candidate continuations, score them with a process reward model (PRM), and expand only the most promising paths.

Snell et al. (2024) showed this is more compute-efficient than pure best-of-N because it prunes bad reasoning paths early rather than completing them fully. Separately, Lightman et al. (2023) demonstrated that process supervision (step-level feedback) significantly outperforms outcome supervision (final-answer-only feedback) for math reasoning, achieving 78% on MATH with a process-supervised verifier.

A key development: automated methods can now generate process supervision labels without human annotations. Math-Shepherd (Wang et al., 2024) trains step-level verifiers automatically. OmegaPRM (Luo et al., 2024) uses Monte Carlo Tree Search to identify errors in reasoning chains via binary search, generating over 1.5M annotations without human involvement. Whether these automated labels fully substitute for human annotations in all settings remains an open question.

5. The Verification Bottleneck

Across all the research, one theme dominates: the bottleneck is not generation, it is verification.

Models can already solve many problems if given enough attempts. The challenge is knowing which attempt is correct. Balachandran et al. (2025) found that with perfect verifiers, all models show dramatic gains from test-time scaling. The gap between what models achieve with real verifiers versus perfect verifiers represents the main frontier for improvement.

This explains why test-time scaling works spectacularly for code (run the tests) and math (check the answer), but struggles for open-ended tasks like creative writing or summarization where "correct" is hard to define.

The PRM advantage

Process reward models help close the verification gap for reasoning tasks. Instead of only checking the final answer, a PRM scores each intermediate step. This catches solutions that arrive at a correct answer through flawed reasoning (which would fool an ORM) and identifies where reasoning goes wrong (which helps with tree search).

Verifier type	What it checks	Advantage	Limitation
ORM	Final answer only	Simple to train	Reward hacking on long chains
PRM	Each reasoning step	Catches flawed reasoning, enables search	Harder to train (but automatable)

6. The Scaling Laws

Like training-time scaling, test-time scaling follows predictable mathematical relationships.

Coverage: exponentiated power law

Brown et al. (2024) found that coverage follows an exponentiated power law with sample count:

Coverage scaling (pass@k):

$$\text{coverage}(k) \;\approx\; 1 - \exp\!\bigl(-a \cdot k^{b}\bigr)$$

Where $k$ is the number of samples, $a$ and $b$ are task-dependent constants. Coverage often follows this pattern, though Brown et al. note exceptions. The general implication is diminishing returns: you need substantially more samples for each additional increment in coverage, with the exact rate depending on the fitted parameters for each task.

Compute-optimal allocation

Snell et al. (2024) showed that the right strategy depends on problem difficulty:

Easy problems: one attempt is enough. Extra compute is wasted.
Medium problems: the sweet spot. More samples or longer thinking yields large gains.
Very hard problems: if the model's base accuracy is near zero, no amount of test-time compute helps. You need a better model.

Their compute-optimal strategy allocates test-time compute adaptively per prompt based on estimated difficulty. This achieves 4x better efficiency than a uniform best-of-N approach that spends the same compute on every query. (Caveat: their difficulty estimation uses 2,048 samples per problem, and the authors note this extra inference cost is not counted in the headline analysis.)

Intuition: This is like a student allocating time during an exam. You do not spend 30 minutes double-checking a trivial question, and you do not keep grinding on a problem you fundamentally do not understand. You allocate most of your review time to questions where checking will actually help.

Provable guarantees without verifiers

Chen et al. (2024) proved that failure probability can decay rapidly without any external verifier, under two assumptions: the model has a non-zero chance of generating a correct answer, and it performs better than random at pairwise comparison. Their approach uses the LLM itself to compare pairs of candidate answers in a knockout tournament. Depending on the scaling regime, failure decays exponentially or by a power law with the number of rounds.

This is significant because it removes the dependency on a separate trained reward model. The LLM serves as both generator and judge.

7. When to Use What

The choice between training-time and test-time scaling, and between parallel and sequential test-time scaling, depends on your constraints.

Scenario	Recommendation	Why
Have a strong verifier (tests, proofs)	Parallel scaling (best-of-N)	Coverage scales log-linearly; verification is reliable
No verifier, need reasoning	Sequential scaling (thinking models)	Self-correction within one attempt, no external verifier needed
High-value, rare queries	Both (many thinking attempts)	Cost per query is justified; maximize correctness
High-throughput, low-stakes	Bigger model, single pass	Per-query cost matters more than marginal accuracy
Task is trivially easy for the model	No extra compute	Already near-perfect; gains are negligible
Task is far beyond model capability	Bigger/better model	Test-time compute cannot create missing capabilities

Warning: Test-time scaling amplifies existing capability; it does not create new capability. If a model has near-zero accuracy on a task (the task is fundamentally beyond its training), no amount of repeated sampling or longer thinking will fix this. You need a better base model, not more inference compute.

Cost comparison

Training compute is a one-time cost, amortized over all queries. Test-time compute is a per-query cost.

This creates a clear economic logic. If you serve millions of queries, training a larger model is cheaper per query. If you serve a few thousand high-value queries (competition math, complex code generation, medical diagnosis), spending 10-100x on inference per query is often cost-effective.

The s1 result is instructive: fine-tuning on 1,000 examples (negligible training cost) plus budget forcing at inference (moderate per-query cost) exceeded o1-preview on competition math. The SFT setup was extremely lightweight, though a direct training-compute comparison with o1 is not possible since OpenAI has not disclosed o1's total training cost.

8. What Test-Time Scaling Cannot Do

Balachandran et al. (2025) systematically tested test-time scaling across nine models and eight task domains. The findings are sobering and important.

Gains are not uniform. Math, STEM, and code benefit significantly. Navigation, spatial reasoning, and planning show inconsistent improvements.
Gains diminish with problem complexity. As problems get harder (within a domain), the marginal benefit of more compute shrinks faster.
More tokens is not always better. Simply generating longer reasoning traces does not guarantee improvement. The quality of reasoning matters, not just the quantity.
Significant gaps persist. On some tasks, even very high scaling regimes leave a large performance gap compared to what would be achievable with perfect reasoning.

The honest assessment: test-time scaling is a powerful tool with clear domains of applicability, not a universal solution. It works best when (1) the model has some baseline competence on the task, (2) verification is possible (automated or via reward models), and (3) the task benefits from deliberate reasoning rather than pattern matching.

References

Snell et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Alone.
Brown et al. (2024). Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.
Chen et al. (2024). Simple and Provable Scaling Laws for the Test-Time Compute of LLMs.
Muennighoff et al. (2025). s1: Simple Test-Time Scaling.
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
Lightman et al. (2023). Let's Verify Step by Step.
Wang et al. (2024). Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations.
Luo et al. (2024). Improve Mathematical Reasoning in Language Models by Automated Process Supervision.
Balachandran et al. (2025). Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead.