Few-Shot Prompting: How to Guide LLMs with Examples
Few-shot prompting teaches LLMs by example instead of instruction. Here's how to choose examples, format them, and know when few-shot is the right approach vs. fine-tuning.
You can tell a model what to do with instructions. Or you can show it. Few-shot prompting gives the model examples of the input-output pattern you want, and the model generalizes from those examples to handle new inputs. No training, no fine-tuning, no parameter updates. The model learns the pattern in context, at inference time, purely from the examples in the prompt.
This is called in-context learning, and it’s one of the most practically useful capabilities of large language models.
Zero-Shot, One-Shot, Few-Shot
The terminology refers to how many examples you provide:
Zero-shot: No examples. Just instructions. “Classify this email as spam or not spam.”
One-shot: One example. “Here’s a spam email and its label. Now classify this one.”
Few-shot: Two or more examples. Typically 3-8 examples covering the range of expected inputs and outputs.
Zero-shot works well for tasks the model already understands from its training data. Few-shot becomes necessary when you need the model to follow a specific format, adopt a non-obvious classification scheme, or handle a domain-specific pattern it hasn’t seen before.
How to Structure Few-Shot Prompts
The format is straightforward: show the model a series of input-output pairs, then present the new input and let the model complete the pattern.
Classify the customer feedback as positive, negative, or neutral.
Feedback: "The shipping was fast and the product works great."
Classification: positive
Feedback: "It arrived broken. Worst purchase I've made."
Classification: negative
Feedback: "It's okay. Does what it says."
Classification: neutral
Feedback: "Love the design but the battery life is disappointing."
Classification:
The model sees the pattern and produces “negative” or “mixed” depending on how it weighs the positive and negative signals. The examples establish what “positive,” “negative,” and “neutral” mean in your specific context, which might differ from the model’s default interpretation.
How Many Examples to Use
More examples generally improve performance, but with diminishing returns. Research and practical experience suggest:
- 3-5 examples cover most use cases well. This gives the model enough signal to generalize without eating too much context.
- 1-2 examples work for simple format demonstrations (“I want the output to look like this”).
- 8-15 examples help for nuanced tasks with many edge cases, where you need to demonstrate the boundary between categories.
Beyond 15 examples, you’re usually better off fine-tuning the model rather than stuffing more examples into the prompt. The context window cost of few-shot examples adds up, both in token usage and in API costs.
Example Selection Matters
The examples you choose have more impact on performance than the number of examples. Three principles:
Cover the output space. If you have five categories, include at least one example per category. Omitting a category makes the model less likely to predict it. If one category is rare but important, include it anyway.
Show edge cases. The easy cases don’t need demonstration. Show the ambiguous ones. If the boundary between “positive” and “neutral” is where your model struggles, include examples that sit on that boundary.
Match the target distribution. If 70% of your real inputs are a certain type, don’t make 70% of your examples that type. Over-representing the common case wastes examples. Instead, distribute examples evenly across categories and use the edge cases to teach the model where the boundaries are.
Example Ordering
Order matters more than you might expect. Models have a recency bias: examples near the end of the prompt have more influence on the output than examples at the beginning. Put your most representative and important examples last.
For classification tasks, avoid clustering all examples of the same label together. Interleave them. A prompt with five positive examples followed by five negative examples can bias the model toward the label of the last cluster it saw.
Combining Few-Shot with Chain of Thought
Few-shot and chain of thought are complementary. You can demonstrate not just the input-output mapping but the reasoning process:
Q: A store has 15 red balls and 10 blue balls. If 3 red and 5 blue balls are sold,
what fraction of remaining balls are red?
A: Starting balls: 15 red, 10 blue (25 total).
After sales: 15-3=12 red, 10-5=5 blue (17 total).
Fraction red: 12/17.
Q: A class has 20 boys and 15 girls. If 5 boys and 3 girls are absent,
what fraction of present students are girls?
A:
The example teaches the model both the answer format and the reasoning approach. This is called few-shot chain of thought, and it produces the strongest results for reasoning-heavy tasks.
When to Switch to Fine-Tuning
Few-shot prompting is fast to set up and easy to iterate. But it has limits:
Context cost. Every example consumes tokens. With 10 examples at 100 tokens each, you’re spending 1,000 tokens on examples before the actual input. At scale, this adds up.
Complexity ceiling. Some tasks are too nuanced to demonstrate in a handful of examples. If you need 50+ examples to cover the pattern, the prompt becomes unwieldy and the model’s in-context learning starts to plateau.
Consistency. Few-shot prompting produces variable results across runs, especially at higher temperature settings. Fine-tuned models tend to be more consistent because the pattern is encoded in the weights rather than the prompt.
The switching point: if you have a stable task with hundreds or thousands of labeled examples, and you need consistent, high-quality results at volume, fine-tuning will outperform few-shot prompting and cost less per request (since you no longer need the examples in every prompt).
For tasks that change frequently, where you’re still iterating on the output format, or where you have limited labeled data, few-shot prompting remains the better choice. It’s the fastest path from “I want the model to do X” to a working implementation.
Practical Tips
Label format consistency. Use the exact same format in every example. If one example uses “Positive” and another uses “positive” and another uses “POSITIVE”, you’re teaching the model inconsistency.
Delimiters. Clearly separate examples from each other and from the actual input. Use consistent markers (blank lines, ”---”, numbered examples) so the model doesn’t confuse example content with instructions.
Test with the actual model. Few-shot effectiveness varies by model. An example set that works well with GPT-5.4 might need adjustment for Claude or Gemini. Test your prompts with the model you’ll deploy, and revisit them when you switch providers.
For a broader view of prompt engineering techniques including how few-shot fits alongside system prompts and structured output formatting, those guides cover the full toolkit.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
System Prompts: How to Write Effective LLM Instructions
System prompts define how your LLM behaves. Here's how to structure them, what mistakes to avoid, and how provider-specific behavior affects your prompt strategy.
Chain of Thought Prompting: A Developer Guide
Chain of thought prompting makes LLMs reason through problems step by step. Here's when it works, when it doesn't, and how to implement it with practical patterns.
Context Windows Explained: Why Your AI Forgets
Context windows determine how much an AI model can 'see' at once. Here's what they are technically, how attention scales, and practical strategies for working within their limits.
What Tokenization Means for Your Prompts
Tokenization isn't just a technical detail. It shapes how LLMs process your input. Understanding it changes the way you write prompts.
Prompt Engineering Complete Guide
A complete guide to prompting. Why it's structured thinking, the three components of a good prompt, common mistakes, and advanced techniques like chain of thought and few-shot learning.
Stripe Launches Machine Payments Protocol for AI Agents
Stripe and Tempo released MPP, an open standard that lets AI agents make autonomous streaming payments across stablecoins, cards, and Bitcoin Lightning.