Prompt Engineering 7 min read

Chain of Thought Prompting: A Developer Guide

Chain of thought prompting makes LLMs reason through problems step by step. Here's when it works, when it doesn't, and how to implement it with practical patterns.

Ask a model “What is 47 * 83?” and it might get it wrong. Ask it “What is 47 * 83? Think step by step.” and it’s far more likely to get it right. That single instruction changes how the model processes the problem, forcing it to generate intermediate reasoning steps rather than jumping to an answer.

This is chain of thought (CoT) prompting: getting the model to show its work. It’s one of the most reliably effective prompting techniques, and understanding when and how to use it will make you better at every prompt engineering task you encounter.

What Chain of Thought Does

When you prompt a model normally, it generates the answer in one forward pass. The “thinking” is compressed into the model’s internal activations, and the output is the final answer. For simple tasks, this works fine. For tasks that require multiple logical steps, the model often skips steps and produces errors.

CoT forces the model to externalize its reasoning. Each step is generated as text, and subsequent steps can attend to (and build on) the earlier steps. The model essentially gets to use its own output as a scratchpad.

This works because transformer models process the entire context (including their own generated text) at each step. When the model writes “First, 47 * 80 = 3,760”, that text becomes part of the context for the next token, helping the model stay on track for “Then, 47 * 3 = 141, so 47 * 83 = 3,901.”

Zero-Shot CoT

The simplest version: append “Think step by step” or “Let’s work through this” to your prompt. No examples needed.

Prompt: A farmer has 17 sheep. All but 9 die. How many sheep are left?
Think step by step.

Model: The phrase "all but 9" means every sheep except 9 died.
So 9 sheep are left.

Without CoT, models frequently answer “8” (computing 17 - 9 without understanding the phrasing). With CoT, the model parses the language first, then computes.

Zero-shot CoT is surprisingly effective for math, logic puzzles, multi-step reasoning, code debugging, and any task where the model needs to decompose a problem. It costs more output tokens (the reasoning steps are part of the output), but the accuracy improvement usually justifies it.

Few-Shot CoT

For more control over the reasoning format, provide examples that demonstrate the step-by-step process. This is few-shot prompting combined with chain of thought:

Q: Roger has 5 tennis balls. He buys 2 cans of 3 tennis balls each.
How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans of 3 balls each,
so he bought 2 * 3 = 6 balls. 5 + 6 = 11. The answer is 11.

Q: The cafeteria had 23 apples. They used 20 to make lunch and bought 6 more.
How many apples do they have?
A:

The model follows the demonstrated reasoning pattern: state what’s known, work through the math, arrive at the answer. Few-shot CoT gives you control over the reasoning format and depth, which matters when you need consistent output structure.

When CoT Helps

CoT produces the largest improvements on tasks that require multi-step reasoning:

Math and arithmetic. Problems involving multiple operations, word problems, or unit conversions. CoT reduces errors by forcing the model to compute intermediate values explicitly.

Logical reasoning. Deduction, conditional logic (“If A then B, if B then C, therefore…”), and constraint satisfaction. Without CoT, models often reach conclusions that violate stated constraints.

Code debugging. “Trace through this code step by step and find the bug” outperforms “What’s wrong with this code?” because the model walks through execution states rather than guessing.

Complex analysis. Tasks like “Compare these three options and recommend one” benefit from CoT because the model evaluates each option explicitly rather than pattern-matching to a conclusion.

When CoT Hurts

CoT is not universally beneficial. It can hurt performance on:

Simple classification. “Is this email spam?” doesn’t need step-by-step reasoning. CoT can introduce overthinking where the model talks itself into a wrong answer by finding irrelevant patterns.

Factual retrieval. “What is the capital of France?” is a recall task, not a reasoning task. CoT adds tokens without improving accuracy.

Tasks with strict output formats. If you need a single word or number as output, CoT requires a separate extraction step to pull the final answer from the reasoning. For structured output use cases, the reasoning steps can interfere with JSON formatting unless you structure the prompt carefully.

Latency-sensitive applications. CoT generates more tokens, which means longer response times. If you’re optimizing for speed, the reasoning overhead may not be acceptable.

Implementation Patterns

Separate reasoning from output. Ask the model to reason in one section and provide the final answer in another. This makes extraction easy:

Analyze this customer support ticket. First, think through the issue step by step
in a <reasoning> section. Then provide your classification in a <result> section
with just the category name.

Control reasoning depth. “Think step by step” is open-ended. For production use, be more specific: “List the 3 key factors, evaluate each one, then decide.” This prevents the model from generating excessive reasoning tokens.

Chain of thought with tools. For agents that use function calling, CoT improves tool selection accuracy. Instruct the model to reason about which tool is appropriate before making the call: “First, determine what information is needed. Then, identify which available tool can provide it. Finally, call the tool.”

Measuring CoT Impact

Don’t assume CoT helps. Measure it. Run the same test set with and without CoT and compare accuracy:

  1. Take 50-100 representative inputs for your task.
  2. Run each input twice: once with a standard prompt, once with CoT.
  3. Score both outputs on your success criteria.
  4. Compare accuracy, output token count, and latency.

If CoT improves accuracy by 5% or more, the extra tokens are worth it. If accuracy is flat or drops, you’re paying for reasoning that doesn’t help. The right answer depends on your specific task, and the only way to know is to test.

For tasks where CoT helps but you don’t want the reasoning in your output, you have two options. First, use the separation pattern above (reason in one section, answer in another) and extract the answer programmatically. Second, use a two-pass approach: generate with CoT for accuracy, then call the model again with just the answer for final formatting. The two-pass approach doubles your API cost but gives you maximum control over the output format.

Self-Consistency

A more advanced CoT technique: generate multiple reasoning chains for the same input and take the majority answer. If you ask the model to solve a math problem five times with CoT, and four of the five chains arrive at the same answer, that answer is more likely correct than any single chain.

This is called self-consistency. It’s expensive (you’re making N times more API calls) but it measurably improves accuracy on tasks where the model sometimes makes reasoning errors. It works best when the errors are random (the model takes a wrong turn in different places each time) rather than systematic (the model always makes the same mistake).

In practice, 3-5 samples give most of the benefit. Beyond that, the accuracy gains diminish while the cost scales linearly.

CoT and Reasoning Models

Models like OpenAI’s o-series and DeepSeek R1 have chain of thought built into their architecture. They generate internal reasoning traces automatically before producing a response. You don’t need to prompt for CoT with these models because they do it by default.

For standard models (GPT-5.4, Claude, Gemini), explicit CoT prompting remains valuable. The technique is free to use (it just costs output tokens) and improves accuracy on any task involving multi-step reasoning. Start with “Think step by step” and refine from there based on the output quality you observe. For guidance on choosing the right temperature setting alongside CoT, lower values (0.0-0.3) tend to produce more focused reasoning chains.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading