Prompt Engineering Guide: How to Write Better AI Prompts
Prompting isn't about magic phrases. It's structured thinking that determines output quality. Here's how to write prompts that actually work, from frameworks to chain-of-thought to system prompts.
Prompting gets dismissed as “just asking nicely.” That’s wrong. Good prompting is structured thinking. You’re designing the input that shapes the model’s output. The difference between a vague request and a well-structured prompt isn’t politeness. It’s information architecture.
Why Prompting Is a Real Skill
LLMs predict the next token. They don’t retrieve facts or reason in the human sense. They generate text that fits the patterns in their training data. Your prompt is the primary signal that steers those patterns. What is an LLM covers the mechanics: the model has no concept of truth, only probability distributions over tokens. Your job is to narrow that distribution toward outputs you want.
That requires understanding what information the model needs, in what order, and in what format. It requires knowing when to add examples, when to ask for step-by-step reasoning, and when to constrain the output structure. None of that is obvious. It’s a learnable skill with measurable impact on output quality.
The Core Principle: Context Quality Determines Output Quality
The single most important rule: garbage in, garbage out. The model can only work with what you give it. Every token in the context window influences the output. Context windows explained covers the limits: you have a fixed budget, and tokenization determines exactly how much fits. The quality of your instructions, the relevance of your examples, and the structure of your request all feed into what the model produces.
This is why “be more specific” is the default advice. Vague prompts produce vague outputs. Specific prompts produce specific outputs. The model doesn’t infer what you meant. It generates text that continues the pattern you established. If your pattern is fuzzy, the continuation will be too.
Structured Frameworks: RISEN and CO-STAR
Frameworks force you to think through the components of a good prompt. Two widely used ones are RISEN and CO-STAR.
RISEN: Role, Instructions, Steps, End Goal, Narrowing
RISEN structures prompts into five components:
Role. Who is the model acting as? “You are a senior Python developer” sets expertise and perspective. The model will draw on patterns associated with that role.
Instructions. What exactly should it do? The core task, stated clearly. “Refactor this function to use type hints and add docstrings.”
Steps. How should it proceed? Break the task into ordered stages. “First, add type annotations to all parameters. Second, add a docstring in Google style. Third, ensure the return type is explicit.”
End Goal. What does success look like? “The function should be production-ready and pass a type checker.”
Narrowing. What constraints apply? “Keep it under 20 lines. Do not change the function’s behavior.”
Example: Instead of “write a blog post about golf,” a RISEN prompt specifies the role (sports journalist), instructions (blog post about youth golf in the UK), steps (include player profiles and statistics), end goal (inform and engage readers), and narrowing (plain English, under 800 words, no jargon). The framework prevents the common failure mode of vague prompts: the model has explicit criteria for what to include and what to avoid.
CO-STAR: Context, Objective, Style, Tone, Audience, Response Format
CO-STAR has six components:
Context. Background the model needs. “We’re building a SaaS product for small accounting firms. The user is a bookkeeper with no technical background.”
Objective. The specific task. “Explain how to export data to Excel.”
Style. How it should be written. “Use the style of Stripe’s documentation: clear, scannable, with concrete examples.”
Tone. The emotional register. “Friendly and reassuring, not condescending.”
Audience. Who will read it. “Bookkeepers who use Excel daily but are not power users.”
Response format. Structure of the output. “Numbered steps with screenshots described in brackets. Include a troubleshooting section at the end.”
CO-STAR separates style (how it’s written) from tone (how it feels), which gives finer control than frameworks that collapse them. It also forces you to specify the response format explicitly, which many prompts omit. Developed by data scientist Sheila Teo, it won Singapore’s first GPT-4 prompt engineering competition. Its strength is reducing hallucinations by grounding the model in specific context and preventing generic, one-size-fits-all outputs.
Few-Shot Prompting
Few-shot prompting means including examples of the output format you want. The model learns the pattern from your examples and reproduces it.
For structured extraction: “Extract the following fields from each paragraph: name, date, amount. Example: ‘John paid $50 on March 1’ becomes {name: ‘John’, date: ‘2024-03-01’, amount: 50}. Now extract from: [your text].”
For classification: “Classify each support ticket as bug, feature request, or question. Example: ‘The login button does nothing’ = bug. ‘Can we add dark mode?’ = feature request. Now classify: [ticket].”
For format: “Convert these notes into meeting minutes. Format: Date, Attendees, Key Decisions, Action Items. Example: [show one full example]. Now convert: [notes].”
One to three examples usually suffice. More examples can help for complex formats, but they consume context. Put the most representative example first. The model weights early context more heavily. Few-shot works because you’re showing the model the exact input-output mapping you want. It doesn’t have to infer the format from your description. It can copy the structure. This is especially valuable for outputs with non-obvious structure: custom JSON schemas, specific markdown layouts, or classification schemes that aren’t standard.
Chain-of-Thought Prompting
For reasoning tasks (math, logic, multi-step analysis), asking the model to “think step by step” often dramatically improves accuracy. This is chain-of-thought prompting.
Wei et al. (2022) introduced chain-of-thought with few-shot examples: they gave the model several math problems with step-by-step solutions, then a new problem. The model learned to show its work and produced more correct answers. On the GSM8K math benchmark, a 540B model with CoT exemplars achieved state-of-the-art results.
Kojima et al. (2022) showed that a single phrase works without examples: “Let’s think step by step.” Adding this before the answer improved performance on arithmetic and reasoning tasks. On MultiArith, accuracy jumped from 17.7% to 78.7%. On GSM8K, from 10.4% to 40.7%. Zero-shot, no hand-crafted demonstrations.
Why it works: the model generates intermediate reasoning tokens before the final answer. Those tokens constrain the probability distribution for the answer. Wrong reasoning often leads to wrong answers; forcing explicit steps surfaces errors and steers toward correct reasoning paths. For tasks where the model would otherwise jump to a wrong conclusion, CoT slows it down and improves accuracy.
Use chain-of-thought for: math, logic puzzles, multi-step planning, and any task where the answer depends on intermediate reasoning. Skip it for: simple lookups, format conversion, or when you need concise output and don’t care about the reasoning process. The tradeoff is token cost. CoT produces longer outputs because the model writes out its reasoning. If you’re paying per token or have strict length limits, factor that in. For accuracy-critical tasks, the extra tokens are usually worth it.
System Prompts
System prompts are instructions sent separately from the user message, typically with higher priority in the model’s context. They define the model’s behavior, persona, and constraints before the conversation starts.
Use system prompts for:
Role and persona. “You are a code review assistant. You focus on security, performance, and readability. You do not suggest stylistic changes unless they affect maintainability.”
Constraints. “Never include personal opinions. Cite sources for factual claims. If unsure, say so.”
Output format. “Always respond in JSON with keys: summary, issues, suggestions.”
Guardrails. “Do not execute code. Do not access external URLs. Do not pretend to have capabilities you don’t have.”
System prompts are powerful because they persist across turns and sit at the beginning of the context, where models attend most strongly. But they still consume tokens. Keep them focused. Put the most critical constraints first. And remember: a long system prompt plus a long user message can push important details into the “lost in the middle” zone where attention drops. Front-load what matters. When building applications, the system prompt is often the only place users can’t directly edit. Use it for behavior that should be consistent across all interactions: safety rules, output format defaults, and persona. Let the user message handle the variable parts of each request.
Common Mistakes
Vague instructions. “Make it better” gives the model nothing to optimize for. “Make it more concise, remove jargon, and add a one-sentence summary at the top” gives clear criteria.
Overloading context. Pasting a 50-page document when the answer is in one section wastes tokens and dilutes attention. Retrieve the relevant parts. Summarize the rest. Every token competes for the model’s focus.
Not specifying format. If you want JSON, say so. If you want bullet points, say so. If you want a table, describe the columns. The model will guess otherwise, and you’ll get inconsistent structure.
Not iterating. First prompts rarely nail it. Run the prompt, inspect the output, identify what’s wrong, and refine. “Add a step to verify the date format” or “Exclude examples shorter than 10 words.” Prompting is iterative. Treat it that way. Keep a log of what you tried and what changed. When a prompt works in testing but fails in production, the difference is often subtle: a different input distribution, a longer context, or a model variant with slightly different behavior. Version your prompts and document the conditions under which each works.
Ignoring temperature. For factual tasks, code generation, and extraction, use low temperature (0 to 0.3). For brainstorming and creative writing, higher is fine. What is AI temperature covers when and why to adjust it. Wrong temperature for the task causes inconsistency or excessive creativity where you need precision. Many developers never touch the default and wonder why the same prompt gives different results. Temperature is the first parameter to check when debugging non-deterministic behavior.
The Shift Toward Context Engineering
Prompting is the visible part. Underneath it, a larger discipline is emerging: context engineering. The prompt is one input to the model. So are retrieved documents, conversation history, tool results, and memory. In production systems, 80 to 90% of the context window is often filled by these other sources. Optimizing only the prompt misses most of the picture.
Context engineering: the most important AI skill in 2026 covers this shift. Retrieval quality, memory design, state management, and information routing all determine what the model sees. A perfect prompt with bad retrieval still produces bad output. The best prompt can’t recover from a conversation history that dropped the user’s key constraint.
Prompt engineering remains essential. It’s the foundation. But as systems get more complex, context engineering becomes the bottleneck. Understanding both is how you build systems that actually work. The skills complement each other: a well-designed prompt is useless if the retrieved documents are wrong, and perfect retrieval is wasted if the prompt doesn’t tell the model how to use the information. Production systems need both layers working together.
Putting It Together
Start with structure. Use RISEN or CO-STAR to force yourself to specify role, task, steps, and format. Add few-shot examples when the output format matters. Use chain-of-thought for reasoning tasks. Put critical constraints in the system prompt. Specify format explicitly. Iterate.
The model doesn’t read your mind. It continues the pattern you establish. The more deliberate you are about that pattern, the better the output. That’s prompt engineering: not magic phrases, but structured thinking that shapes what the model produces.
For a complete treatment of prompting, context design, and production AI systems, Get Insanely Good at AI covers these mechanics in depth.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
What Is AI Temperature and How Does It Affect Output?
Temperature controls how random or deterministic an AI model's output is. Here's what it does technically, how it relates to top-p and top-k, and when to adjust it.
GPT vs Claude vs Gemini: Which AI Model Should You Use?
A practical comparison of GPT, Claude, and Gemini. Their real strengths, pricing, context windows, and which model fits which task in 2026.
Structured Output from LLMs: JSON Mode Explained
LLMs generate text, but applications need structured data. Here's how JSON mode, function calling, and schema enforcement turn free-form AI output into reliable, typed data.
Fine-Tuning vs RAG: When to Use Each Approach
RAG changes what the model knows. Fine-tuning changes how it behaves. Here's when to use each approach, their real tradeoffs, and why the answer is usually both.
What Is an LLM? How Large Language Models Actually Work
LLMs predict text, they don't understand it. Here's how large language models work under the hood, from training to transformers to next-token prediction, and why it matters for how you use them.
Anthropic Makes Claude's 1M Token Context Generally Available
Anthropic made 1M-token context GA for Claude 4.6, removing long-context premiums and boosting throughput for large code and agent tasks.