Model Distillation
Model distillation is a training technique where a small "student" model learns to mimic a much larger "teacher" model — by training on the teacher's outputs (or its internal probability distributions) instead of raw labels. The result is a model with most of the teacher's capability at a fraction of the size, latency, and cost.
Model distillation is a training technique where a small "student" model learns to mimic a much larger "teacher" model — by training on the teacher's outputs (or its internal probability distributions) instead of raw labels. The result is a model with most of the teacher's capability at a fraction of the size, latency, and cost.
Why It Matters
The frontier-vs-cheap tradeoff used to be brutal: pay 10× for a 5% smarter model, or settle. Distillation collapses that gap. GPT-4o-mini, Claude Haiku, Gemini Flash, Llama 3 8B Instruct — every "small fast cheap" tier from a major lab is, in practice, a distilled descendant of a flagship model. Distillation is also the preferred way to specialize: a 7B model distilled from GPT-4 on customer-support transcripts can beat the original on that one task while costing 1/100th to serve. For builders, distillation reframes "which model do I use" from "biggest I can afford" to "what's the smallest model that still does my job."
How It Works
1. Pick a teacher: Usually a large, capable model (GPT-4, Claude Opus, Llama 70B).
2. Generate training data: Either:
- Output distillation: Run the teacher on a large set of inputs and save its responses. Train the student on those (input, teacher-response) pairs.
- Logit distillation: Capture the teacher's full probability distribution over the vocabulary at each token (the "soft targets"), and train the student to match.
3. Train the student: Standard supervised fine-tuning, but using teacher outputs as labels. The student's loss is its divergence from the teacher's output, not from a human-labeled gold answer.
4. Optional task focus: Distill on data from a specific domain (code, chat, math, customer support) for a specialized small model.
5. Evaluate: Compare student vs teacher on held-out benchmarks. Aim for 80–95% of teacher quality at <10% of cost.
Output vs Logit Distillation
| Aspect | Output (response) distillation | Logit (soft-target) distillation |
|---|---|---|
| Data | Just teacher's text outputs | Teacher's full token probabilities |
| Access required | API only | Need raw model weights |
| Quality | Good | Better (more signal per token) |
| Cost | Cheap | More expensive (capture cost) |
| Use case | Distill from closed APIs | Distill from open or own models |
Output distillation is what most teams do because they don't have weight-level access to GPT-4 or Claude. Logit distillation is the academic standard but requires open models.
Famous Distilled Models
DistilBERT (Hugging Face, 2019): The original. 60% of BERT's size, 95% of its performance, 60% faster.
Alpaca / Vicuna (Stanford / LMSYS, 2023): Llama distilled from GPT-3.5 outputs. Made small instruction-following models cheap.
GPT-4o-mini, Claude Haiku, Gemini Flash: Reportedly distilled from their respective flagships, though details aren't public.
Llama 3.2 1B / 3B: Meta's small models distilled from larger Llama variants for on-device use.
DeepSeek-R1-Distill (2025): Open distilled versions of DeepSeek-R1's reasoning into smaller Llama and Qwen bases.
TinyLlama, Phi-3: Small models trained with distillation-style techniques to punch above their parameter weight.
When to Use Distillation
Cost-driven product: You need most of the quality but can't afford GPT-4 or Claude Opus on every request.
Latency-sensitive UX: Chat assistants where responses must be sub-second.
Specialization: A narrow task (intent classification, JSON extraction, code completion) where a small fine-tuned model beats the general flagship.
On-device or air-gapped: Where running a 70B model is impossible.
High-volume batch processing: Document classification at millions per day — flagship models are too expensive.
When Not To Use It
You don't have enough teacher data: Need thousands of high-quality (input, teacher-output) pairs minimum.
Open-ended creative tasks: Distilled models often lose nuance and creativity.
Frontier reasoning: Math, coding, and complex reasoning still benefit from running the actual frontier model.
Rapidly changing domains: A distilled model is a snapshot. If the domain changes weekly, the distillation lags.
Trade-offs
Quality ceiling: Student can't exceed teacher. Distillation transfers, doesn't create.
Brittleness on unfamiliar inputs: Small models generalize less. Out-of-distribution inputs degrade fast.
Bias inheritance: Teacher's biases (and errors, and hallucinations) are baked into the student.
API cost upfront: Distilling from a closed API requires paying for thousands of teacher inferences during data generation.
Compliance risk: Some closed-API ToS forbid using outputs to train competing models. Read the terms.
Common Mistakes
Distilling without evaluation: Without held-out benchmarks, you can't tell if the student matches the teacher.
Tiny student, complex teacher: A 1B student can't capture all of a 175B teacher's behavior. Match scale to ambition.
Skipping data quality: Bad teacher outputs (hallucinated, off-task) become baked-in bad student behavior.
No specialization: Distilling a general model from a general model often produces a worse general model. Distill for a task.
Compliance blind spots: Quietly training on competitors' API outputs is a legal time bomb. Confirm ToS.
Sources: