Reasoning Models
Reasoning models are large language models trained to generate extended internal deliberation before producing a final answer, using test-time compute to improve accuracy on complex tasks such as mathematics, coding, and multi-step logic.
Reasoning models are a class of large language models designed to improve performance on complex tasks by spending additional computation at inference time to deliberate before producing an answer. Rather than generating a response immediately from the prompt, a reasoning model first produces an extended sequence of intermediate thoughts — exploring approaches, checking assumptions, and self-correcting — before emitting its final output. This process is sometimes called a chain-of-thought or internal scratchpad, though in most commercial reasoning models it is generated automatically through reinforcement learning rather than by explicit instruction.
The Test-Time Compute Paradigm
The dominant scaling strategy for AI models through 2023 was training-time scaling: building larger models, using more training data, and running longer training runs. This approach, described by the empirical Chinchilla scaling laws, produced substantial capability gains but reached practical limits in terms of cost and data availability.
Reasoning models represent a complementary scaling axis: test-time compute scaling. The insight is that spending more computation during inference — generating thousands of tokens of deliberation rather than a short direct response — can substantially improve performance on tasks that benefit from iterative refinement. A model that is allowed to "think" for longer on a hard mathematics problem produces more accurate results than the same model answering directly, even without any change to its weights.
This observation was not new — prompt-based chain-of-thought techniques had demonstrated similar effects since 2022. What changed with reasoning models was making the deliberation process trainable rather than prompt-dependent, and scaling it to a degree that produced significant benchmark improvements.
Training Methodology
Reasoning models are typically trained using reinforcement learning (RL) with outcome-based rewards. The model receives reward when its final answer is correct (as judged by a verifier) and no reward otherwise. The intermediate reasoning trace is not directly supervised; instead, the RL process discovers reasoning strategies that tend to produce correct final answers.
DeepSeek-R1, released in January 2025, demonstrated that a model could learn extended multi-step reasoning through pure RL with outcome rewards, without any supervised fine-tuning on reasoning traces. The resulting model matched the performance of OpenAI o1 on several benchmarks at approximately 70% lower inference cost, and DeepSeek released both the model weights and a detailed technical report, catalysing extensive follow-on research.
OpenAI's o1 and o3 models use a similar paradigm but with additional techniques including process reward models (PRMs) that score the quality of intermediate reasoning steps rather than only the final answer. o3 achieved 75.7% accuracy on the ARC-AGI benchmark, a test of abstract reasoning that had been considered near-human-level difficulty.
Characteristics and Trade-offs
Reasoning models generate substantially more tokens than standard language models. A standard model might answer a mathematics question in 50 tokens; a reasoning model for the same question might generate 2,000 tokens of deliberation before the final answer. This means that inference costs and latency are significantly higher per query, and analysts project that reasoning workloads will account for 75% of total AI inference compute by 2030.
The extended deliberation also means that reasoning models are better calibrated on tasks with verifiable answers — mathematics, formal logic, code correctness — than on open-ended tasks where there is no ground truth. They can still hallucinate, and longer deliberation does not guarantee correctness; some research has documented cases where additional thinking degrades accuracy on simpler tasks.
Reasoning models are also less efficient for tasks that do not benefit from deliberation, such as simple retrieval, format conversion, or conversational responses. Many AI providers therefore offer a tiered product: a fast standard model for everyday tasks and a reasoning model for tasks requiring deep analysis.
Notable Models
| Model | Provider | Released | Key benchmark | |---|---|---|---| | o1 | OpenAI | September 2024 | 83.3% on AIME 2024 | | o3 | OpenAI | December 2024 | 75.7% on ARC-AGI | | DeepSeek-R1 | DeepSeek | January 2025 | Comparable to o1 | | Gemini 2.0 Flash Thinking | Google | January 2025 | Competitive on MATH | | Claude 3.7 Sonnet | Anthropic | February 2025 | Strong on SWE-bench |
See Also
References
- OpenAI. (2024). Learning to reason with LLMs. OpenAI Research Blog. https://openai.com/index/learning-to-reason-with-llms/
- DeepSeek-AI. (2025). DeepSeek-R1: Incentivising reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948.
- Wei, J. et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35.
- Zylos Research. (2026). AI reasoning models 2026: From OpenAI o3 to DeepSeek-R1 and the test-time compute revolution. Zylos.ai.
- Snell, C. et al. (2024). Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv:2408.03314.