Few-Shot Learning
Few-shot learning is the prompt engineering technique of including 2–5 "input → desired output" examples in the prompt so the LLM imitates the pattern. With no additional training, it's one of the most practical ways to align model behavior just through prompt design.
Few-shot learning is the prompt engineering technique of including 2–5 "input → desired output" examples in the prompt so the LLM imitates the pattern. With no additional training, it's one of the most practical ways to align model behavior just through prompt design.
Why It Matters
Systematically introduced in the 2020 GPT-3 paper "Language Models are Few-Shot Learners," the technique demonstrated that large LLMs could perform tasks they'd never explicitly been trained on after seeing just a few examples. Accuracy on the same task averages 20–40% higher with few-shot versus zero-shot. It's the cheapest meaningful quality improvement available without fine-tuning.
Zero-Shot vs Few-Shot vs Fine-Tuning
Zero-Shot: Instructions only, no examples.
"Classify the sentiment of this sentence as positive/negative/neutral: [sentence]"
Few-Shot: 2–5 example pairs included.
"Classify as positive, negative, or neutral. Example 1: 'It was really great' → positive Example 2: 'Not for me' → negative Example 3: 'It was okay' → neutral Sentence to classify: [new sentence]"
Fine-Tuning: Update model weights with hundreds to thousands of examples.
| Aspect | Zero-Shot | Few-Shot | Fine-Tuning |
|---|---|---|---|
| Setup cost | None | Minutes | Hours to days |
| Accuracy | Low | Middle | High |
| Token consumption | Low | Medium (examples inflate prompt) | Low (post-training) |
| Flexibility | Change instantly | Change instantly | Requires retraining |
Few-shot sits between the two, and it's the sweet spot for "most production tasks that need a quick quality boost."
Designing Effective Few-Shot Examples
Cover diverse cases: Include positives, negatives, and edge cases so the model infers the distribution.
Consistent format: Every example must follow the same input → output format. Inconsistent formats hurt accuracy.
Hard boundary cases: Easy examples leave the model unsure on edges. Include subtle cases like "looks positive but is actually neutral."
Example ordering: Research shows ordering affects results. A common heuristic is clearest examples first, then harder ones.
Number of examples: 3–5 is optimal for most tasks. More usually adds token cost with diminishing returns.
Good Use Cases
Classification: Auto-labeling customer inquiries by category.
Format conversion: JSON to Markdown, unstructured text to structured data.
Style imitation: Learning a brand voice or author's prose from a handful of examples.
Domain-specific extraction: Pulling specific fields out of contracts or papers.
Translation tuning: Customizing translation to include your glossary.
Limitations
Context waste: Long examples eat tokens and shrink the effective context window.
Less consistent than fine-tuning: High-volume repetitive tasks still favor fine-tuning.
Modern models are better at zero-shot: Claude Opus 4.6, GPT-5, and similar frontier models close much of the zero-shot gap, so the few-shot advantage is smaller than it was. Often zero-shot suffices.
Example quality determines output: Bad examples → bad outputs. Example design is the core quality lever.
Sources: