Few-Shot Prompting: How to Guide LLMs with Examples

You can tell a model what to do with instructions. Or you can show it. Few-shot prompting gives the model examples of the input-output pattern you want, and the model generalizes from those examples to handle new inputs. No training, no fine-tuning, no parameter updates. The model learns the pattern in context, at inference time, purely from the examples in the prompt.

This is called in-context learning, and it’s one of the most practically useful capabilities of large language models.

Zero-Shot, One-Shot, Few-Shot

The terminology refers to how many examples you provide:

Zero-shot: No examples. Just instructions. “Classify this email as spam or not spam.”

One-shot: One example. “Here’s a spam email and its label. Now classify this one.”

Few-shot: Two or more examples. Typically 3-8 examples covering the range of expected inputs and outputs.

Zero-shot works well for tasks the model already understands from its training data. Few-shot becomes necessary when you need the model to follow a specific format, adopt a non-obvious classification scheme, or handle a domain-specific pattern it hasn’t seen before.

How to Structure Few-Shot Prompts

The format is straightforward: show the model a series of input-output pairs, then present the new input and let the model complete the pattern.

Classify the customer feedback as positive, negative, or neutral.

Feedback: "The shipping was fast and the product works great."
Classification: positive

Feedback: "It arrived broken. Worst purchase I've made."
Classification: negative

Feedback: "It's okay. Does what it says."
Classification: neutral

Feedback: "Love the design but the battery life is disappointing."
Classification:

The model sees the pattern and produces one of the demonstrated labels. For this input, it would likely output “negative” or “neutral” depending on how it weighs the positive and negative signals. The examples constrain the output space and establish what each label means in your specific context, which might differ from the model’s default interpretation.

How Many Examples to Use

More examples generally improve performance, but with diminishing returns. Research and practical experience suggest:

3-5 examples cover most use cases well. This gives the model enough signal to generalize without eating too much context.
1-2 examples work for simple format demonstrations (“I want the output to look like this”).
8-15 examples help for nuanced tasks with many edge cases, where you need to demonstrate the boundary between categories.

Beyond 15 examples, you’re usually better off fine-tuning the model rather than stuffing more examples into the prompt. The context window cost of few-shot examples adds up, both in token usage and in API costs.

Example Selection Matters

The examples you choose have more impact on performance than the number of examples. Three principles:

Cover the output space. If you have five categories, include at least one example per category. Omitting a category makes the model less likely to predict it. If one category is rare but important, include it anyway.

Show edge cases. The easy cases don’t need demonstration. Show the ambiguous ones. If the boundary between “positive” and “neutral” is where your model struggles, include examples that sit on that boundary.

Match the target distribution. If 70% of your real inputs are a certain type, don’t make 70% of your examples that type. Over-representing the common case wastes examples. Instead, distribute examples evenly across categories and use the edge cases to teach the model where the boundaries are.

Example Ordering

Order matters more than you might expect. Research has shown that reordering the same few-shot examples can swing accuracy from near state-of-the-art to random-guess performance. Some models show a recency bias (more influenced by later examples), while others show a primacy bias (more influenced by earlier examples). The direction varies by model and task, so test both orderings with your specific setup.

For classification tasks, avoid clustering all examples of the same label together. Interleave them. A prompt with five positive examples followed by five negative examples can bias the model toward the label of the last cluster it saw.

Combining Few-Shot with Chain of Thought

Few-shot and chain of thought are complementary. You can demonstrate not just the input-output mapping but the reasoning process:

Q: A store has 15 red balls and 10 blue balls. If 3 red and 5 blue balls are sold,
what fraction of remaining balls are red?
A: Starting balls: 15 red, 10 blue (25 total).
After sales: 15-3=12 red, 10-5=5 blue (17 total).
Fraction red: 12/17.

Q: A class has 20 boys and 15 girls. If 5 boys and 3 girls are absent,
what fraction of present students are girls?
A:

The example teaches the model both the answer format and the reasoning approach. This is called few-shot chain of thought, and it produces the strongest results for reasoning-heavy tasks.

When to Switch to Fine-Tuning

Few-shot prompting is fast to set up and easy to iterate. But it has limits:

Context cost. Every example consumes tokens. With 10 examples at 100 tokens each, you’re spending 1,000 tokens on examples before the actual input. At scale, this adds up.

Complexity ceiling. Some tasks are too nuanced to demonstrate in a handful of examples. If you need 50+ examples to cover the pattern, the prompt becomes unwieldy and the model’s in-context learning starts to plateau.

Consistency. Few-shot prompting produces variable results across runs, especially at higher temperature settings. Fine-tuned models tend to be more consistent because the pattern is encoded in the weights rather than the prompt.

The switching point: if you have a stable task with hundreds or thousands of labeled examples, and you need consistent, high-quality results at volume, fine-tuning will outperform few-shot prompting and cost less per request (since you no longer need the examples in every prompt).

For tasks that change frequently, where you’re still iterating on the output format, or where you have limited labeled data, few-shot prompting remains the better choice. It’s the fastest path from “I want the model to do X” to a working implementation.

Practical Tips

Label format consistency. Use the exact same format in every example. If one example uses “Positive” and another uses “positive” and another uses “POSITIVE”, you’re teaching the model inconsistency.

Delimiters. Clearly separate examples from each other and from the actual input. Use consistent markers (blank lines, ”---”, numbered examples) so the model doesn’t confuse example content with instructions.

Test with the actual model. Few-shot effectiveness varies by model. An example set that works well with GPT-5.4 might need adjustment for Claude or Gemini. Test your prompts with the model you’ll deploy, and revisit them when you switch providers.

For a broader view of prompt engineering techniques including how few-shot fits alongside system prompts and structured output formatting, those guides cover the full toolkit.

Few-Shot Prompting: How to Guide LLMs with Examples

Zero-Shot, One-Shot, Few-Shot

How to Structure Few-Shot Prompts

How Many Examples to Use

Example Selection Matters

Example Ordering

Combining Few-Shot with Chain of Thought

When to Switch to Fine-Tuning

Practical Tips

Keep Reading

iOS 27 Adds Natural Language Prompting to Apple Shortcuts

System Prompts: How to Write Effective LLM Instructions

Chain of Thought Prompting: A Developer Guide

Context Windows Explained: Why Your AI Forgets

What Tokenization Means for Your Prompts