Chain of Thought Prompting: A Developer Guide
Chain of thought prompting makes LLMs reason through problems step by step. Here's when it works, when it doesn't, and how to implement it with practical patterns.
Ask a model “What is 47 * 83?” and it might get it wrong. Ask it “What is 47 * 83? Think step by step.” and it’s far more likely to get it right. That single instruction changes how the model processes the problem, forcing it to generate intermediate reasoning steps rather than jumping to an answer.
This is chain of thought (CoT) prompting: getting the model to show its work. It’s one of the most reliably effective prompting techniques, and understanding when and how to use it will make you better at every prompt engineering task you encounter.
What Chain of Thought Does
When you prompt a model normally, it generates the answer in one forward pass. The “thinking” is compressed into the model’s internal activations, and the output is the final answer. For simple tasks, this works fine. For tasks that require multiple logical steps, the model often skips steps and produces errors.
CoT forces the model to externalize its reasoning. Each step is generated as text, and subsequent steps can attend to (and build on) the earlier steps. The model essentially gets to use its own output as a scratchpad.
This works because transformer models process the entire context (including their own generated text) at each step. When the model writes “First, 47 * 80 = 3,760”, that text becomes part of the context for the next token, helping the model stay on track for “Then, 47 * 3 = 141, so 47 * 83 = 3,901.”
Zero-Shot CoT
The simplest version: append “Think step by step” or “Let’s work through this” to your prompt. No examples needed.
Prompt: A farmer has 17 sheep. All but 9 die. How many sheep are left?
Think step by step.
Model: The phrase "all but 9" means every sheep except 9 died.
So 9 sheep are left.
Without CoT, models frequently answer “8” (computing 17 - 9 without understanding the phrasing). With CoT, the model parses the language first, then computes.
Zero-shot CoT is surprisingly effective for math, logic puzzles, multi-step reasoning, code debugging, and any task where the model needs to decompose a problem. It costs more output tokens (the reasoning steps are part of the output), but the accuracy improvement usually justifies it.
Few-Shot CoT
For more control over the reasoning format, provide examples that demonstrate the step-by-step process. This is few-shot prompting combined with chain of thought:
Q: Roger has 5 tennis balls. He buys 2 cans of 3 tennis balls each.
How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans of 3 balls each,
so he bought 2 * 3 = 6 balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. They used 20 to make lunch and bought 6 more.
How many apples do they have?
A:
The model follows the demonstrated reasoning pattern: state what’s known, work through the math, arrive at the answer. Few-shot CoT gives you control over the reasoning format and depth, which matters when you need consistent output structure.
When CoT Helps
CoT produces the largest improvements on tasks that require multi-step reasoning:
Math and arithmetic. Problems involving multiple operations, word problems, or unit conversions. CoT reduces errors by forcing the model to compute intermediate values explicitly.
Logical reasoning. Deduction, conditional logic (“If A then B, if B then C, therefore…”), and constraint satisfaction. Without CoT, models often reach conclusions that violate stated constraints.
Code debugging. “Trace through this code step by step and find the bug” outperforms “What’s wrong with this code?” because the model walks through execution states rather than guessing.
Complex analysis. Tasks like “Compare these three options and recommend one” benefit from CoT because the model evaluates each option explicitly rather than pattern-matching to a conclusion.
When CoT Hurts
CoT is not universally beneficial. It can hurt performance on:
Simple classification. “Is this email spam?” doesn’t need step-by-step reasoning. CoT can introduce overthinking where the model talks itself into a wrong answer by finding irrelevant patterns.
Factual retrieval. “What is the capital of France?” is a recall task, not a reasoning task. CoT adds tokens without improving accuracy.
Tasks with strict output formats. If you need a single word or number as output, CoT requires a separate extraction step to pull the final answer from the reasoning. For structured output use cases, the reasoning steps can interfere with JSON formatting unless you structure the prompt carefully.
Latency-sensitive applications. CoT generates more tokens, which means longer response times. If you’re optimizing for speed, the reasoning overhead may not be acceptable.
Implementation Patterns
Separate reasoning from output. Ask the model to reason in one section and provide the final answer in another. This makes extraction easy:
Analyze this customer support ticket. First, think through the issue step by step
in a <reasoning> section. Then provide your classification in a <result> section
with just the category name.
Control reasoning depth. “Think step by step” is open-ended. For production use, be more specific: “List the 3 key factors, evaluate each one, then decide.” This prevents the model from generating excessive reasoning tokens.
Chain of thought with tools. For agents that use function calling, CoT improves tool selection accuracy. Instruct the model to reason about which tool is appropriate before making the call: “First, determine what information is needed. Then, identify which available tool can provide it. Finally, call the tool.”
Measuring CoT Impact
Don’t assume CoT helps. Measure it. Run the same test set with and without CoT and compare accuracy:
- Take 50-100 representative inputs for your task.
- Run each input twice: once with a standard prompt, once with CoT.
- Score both outputs on your success criteria.
- Compare accuracy, output token count, and latency.
If CoT improves accuracy by 5% or more, the extra tokens are worth it. If accuracy is flat or drops, you’re paying for reasoning that doesn’t help. The right answer depends on your specific task, and the only way to know is to test.
For tasks where CoT helps but you don’t want the reasoning in your output, you have two options. First, use the separation pattern above (reason in one section, answer in another) and extract the answer programmatically. Second, use a two-pass approach: generate with CoT for accuracy, then call the model again with just the answer for final formatting. The two-pass approach doubles your API cost but gives you maximum control over the output format.
Self-Consistency
A more advanced CoT technique: generate multiple reasoning chains for the same input and take the majority answer. If you ask the model to solve a math problem five times with CoT, and four of the five chains arrive at the same answer, that answer is more likely correct than any single chain.
This is called self-consistency. It’s expensive (you’re making N times more API calls) but it measurably improves accuracy on tasks where the model sometimes makes reasoning errors. It works best when the errors are random (the model takes a wrong turn in different places each time) rather than systematic (the model always makes the same mistake).
In practice, 3-5 samples give most of the benefit. Beyond that, the accuracy gains diminish while the cost scales linearly.
CoT and Reasoning Models
Models like OpenAI’s o-series and DeepSeek R1 have chain of thought built into their architecture. They generate internal reasoning traces automatically before producing a response. You don’t need to prompt for CoT with these models because they do it by default.
For standard models (GPT-5.4, Claude, Gemini), explicit CoT prompting remains valuable. The technique is free to use (it just costs output tokens) and improves accuracy on any task involving multi-step reasoning. Start with “Think step by step” and refine from there based on the output quality you observe. For guidance on choosing the right temperature setting alongside CoT, lower values (0.0-0.3) tend to produce more focused reasoning chains.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
System Prompts: How to Write Effective LLM Instructions
System prompts define how your LLM behaves. Here's how to structure them, what mistakes to avoid, and how provider-specific behavior affects your prompt strategy.
Few-Shot Prompting: How to Guide LLMs with Examples
Few-shot prompting teaches LLMs by example instead of instruction. Here's how to choose examples, format them, and know when few-shot is the right approach vs. fine-tuning.
Prompt Engineering Guide: How to Write Better AI Prompts
Prompting isn't about magic phrases. It's structured thinking that determines output quality. Here's how to write prompts that actually work, from frameworks to chain-of-thought to system prompts.
Context Windows Explained: Why Your AI Forgets
Context windows determine how much an AI model can 'see' at once. Here's what they are technically, how attention scales, and practical strategies for working within their limits.
What Tokenization Means for Your Prompts
Tokenization isn't just a technical detail. It shapes how LLMs process your input. Understanding it changes the way you write prompts.
Stripe Launches Machine Payments Protocol for AI Agents
Stripe and Tempo released MPP, an open standard that lets AI agents make autonomous streaming payments across stablecoins, cards, and Bitcoin Lightning.