Test-Time Compute
Test-time compute refers to the computational effort a model expends during inference rather than training, a paradigm in which large language models improve reasoning by generating and evaluating more intermediate steps before answering.
Overview
Test-time compute, also called inference-time scaling, describes the amount of computation a model uses when generating an answer, as opposed to the computation used to train it. The concept rose to prominence in 2024 and 2025 with the emergence of reasoning models that deliberately think for longer before responding. Rather than relying solely on larger models trained on more data, this approach improves performance by allocating additional compute at the moment a question is asked.
The central insight is that for difficult problems, allowing a model to generate, explore and evaluate many intermediate reasoning steps can yield accuracy that simply scaling up model size and training does not. Empirical results suggest that a smaller model given substantially more inference compute can rival a much larger model using standard inference.
Approaches
Test-time scaling techniques fall into several broad categories.
Sequential scaling
The model produces an extended chain of thought, working through a problem step by step and sometimes revising earlier steps. Reasoning models such as OpenAI's o1 and o3 series, DeepSeek-R1 and reasoning-tuned versions of Gemini are trained to generate long internal reasoning traces before committing to a final answer.
Parallel scaling
The model generates many independent candidate answers and selects among them. Self-consistency samples multiple reasoning paths and takes a majority vote, while best-of-n sampling uses a verifier or reward model to pick the strongest candidate.
Search-based scaling
Techniques borrowed from classical search, including tree-of-thoughts and variants of Monte Carlo tree search, let the model branch into multiple lines of reasoning, evaluate them and prune weak paths. These methods trade additional compute for more thorough exploration of the solution space.
Trade-offs
Test-time compute exposes a tunable trade-off between cost, latency and quality. Spending more compute improves results on hard reasoning, mathematics and coding tasks but increases response time and expense, so systems may allocate effort adaptively based on estimated difficulty. Research in 2025 also documented an over-reasoning effect, where excessive deliberation on easy questions wastes resources and can even degrade calibration. Designing models that decide how much to think remains an active research area.
Significance
The shift toward test-time compute has reshaped how the field thinks about progress. For much of the previous decade, gains came from scaling training. Inference-time scaling adds a complementary axis, with implications for hardware demand, since serving reasoning models requires more compute per query, and for the economics of deploying AI at scale.
References
- OpenAI. (2024). Learning to Reason with LLMs. o1 system documentation.
- Snell, C., Lee, J., Xu, K. and Kumar, A. (2024). Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters. arXiv.
- DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv.