Chain-of-Thought Prompting
A prompt engineering technique that improves large language model reasoning on complex tasks by instructing the model to generate explicit intermediate reasoning steps before arriving at a final answer.
Chain-of-thought (CoT) prompting is a technique in prompt engineering that elicits step-by-step reasoning from a large language model by including intermediate reasoning steps—the "chain of thought"—in the prompt or by instructing the model to generate such steps before producing a final answer. Rather than asking a model to jump directly from a question to a conclusion, CoT prompting encourages it to decompose the problem and work through it sequentially, mirroring the way a human might reason through a mathematics problem or a logical puzzle.
The technique was formalised and popularised by Jason Wei and colleagues at Google Brain in a 2022 paper demonstrating that simply providing a few examples that included reasoning steps significantly improved the performance of large language models on arithmetic, commonsense, and symbolic reasoning benchmarks.[^1] The improvement was particularly pronounced in larger models, suggesting an emergent capability that scales with model size.
How Chain-of-Thought Prompting Works
In a standard prompt, a model receives a question and is expected to produce a direct answer. In a chain-of-thought prompt, the model is additionally shown—or instructed to produce—a sequence of reasoning steps that lead logically to the answer. For example, a prompt might read: "Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans × 3 balls = 6 balls. 5 + 6 = 11 balls." By providing this structured reasoning in the context, the model learns the expected format and applies analogous reasoning to new problems.
Variants
Zero-Shot CoT
The simplest variant requires no examples at all. Appending the phrase "Let's think step by step" to the end of a question prompts many large language models to spontaneously produce intermediate reasoning. This zero-shot approach was identified by Kojima et al. (2022) and works because instruction-tuned models have been trained on so many examples of step-by-step explanation that the phrase activates a corresponding generation mode.[^2]
Few-Shot CoT
In few-shot CoT, the prompt includes several complete worked examples, each showing both the reasoning chain and the final answer. This is especially useful when the desired reasoning style is specific—for instance, structured mathematical proofs, debugging traces, or multi-step logical deductions—and when the task is sufficiently complex that zero-shot prompting fails to produce the right structure.
Auto-CoT
Auto-CoT automates the construction of few-shot examples by sampling questions from a task dataset, clustering them by semantic diversity, and generating reasoning chains for each cluster using zero-shot CoT. The resulting demonstrations are more diverse and less likely to share common errors than manually curated examples.[^3]
Multimodal CoT
Introduced by Zhang et al. (2023), multimodal CoT extends the technique to settings where reasoning draws on both language and visual inputs—for example, solving a science question that requires interpreting a diagram alongside textual information. The model first generates a rationale from both modalities and then produces its answer conditioned on that rationale.[^3]
Relationship to Reasoning Models
Chain-of-thought prompting has directly influenced the design of a new class of reasoning models that generate extended internal thinking before producing output. OpenAI's o1 and o3 series, Anthropic's Claude "extended thinking" mode, and DeepSeek-R1 all implement variants of chain-of-thought reasoning at training time through reinforcement learning, so the model learns to produce reasoning traces as an intrinsic behaviour rather than requiring explicit prompting.[^4] DeepSeek-R1, released in January 2025, uses a self-supervised mechanism to refine its reasoning chains, achieving strong performance on mathematical and coding benchmarks.
Limitations
Chain-of-thought prompting improves reasoning accuracy but does not eliminate errors. Models can produce plausible-sounding but incorrect reasoning chains—a form of reasoning hallucination in which each step sounds locally coherent but the global conclusion is wrong. CoT also increases the length of model outputs, increasing inference cost and latency. For very long reasoning chains, models may lose track of intermediate results, especially at shorter context windows.
See Also
References
References
- Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35. arXiv:2201.11903.
- Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916.
- Zhang, Z., Zhang, A., Li, M., et al. (2023). Multimodal Chain-of-Thought Reasoning in Language Models. arXiv:2302.00923.
- Promptingguide.ai. (2024). Chain-of-Thought Prompting. https://www.promptingguide.ai/techniques/cot