GEO

Model Distillation

Model distillation is a training technique where a small "student" model learns to mimic a much larger "teacher" model — by training on the teacher's outputs (or its internal probability distributions) instead of raw labels. The result is a model with most of the teacher's capability at a fraction of the size, latency, and cost.

Model distillation is a training technique where a small "student" model learns to mimic a much larger "teacher" model — by training on the teacher's outputs (or its internal probability distributions) instead of raw labels. The result is a model with most of the teacher's capability at a fraction of the size, latency, and cost.

Why It Matters

The frontier-vs-cheap tradeoff used to be brutal: pay 10× for a 5% smarter model, or settle. Distillation collapses that gap. GPT-4o-mini, Claude Haiku, Gemini Flash, Llama 3 8B Instruct — every "small fast cheap" tier from a major lab is, in practice, a distilled descendant of a flagship model. Distillation is also the preferred way to specialize: a 7B model distilled from GPT-4 on customer-support transcripts can beat the original on that one task while costing 1/100th to serve. For builders, distillation reframes "which model do I use" from "biggest I can afford" to "what's the smallest model that still does my job."

How It Works

1. Pick a teacher: Usually a large, capable model (GPT-4, Claude Opus, Llama 70B).

2. Generate training data: Either:

  • Output distillation: Run the teacher on a large set of inputs and save its responses. Train the student on those (input, teacher-response) pairs.
  • Logit distillation: Capture the teacher's full probability distribution over the vocabulary at each token (the "soft targets"), and train the student to match.

3. Train the student: Standard supervised fine-tuning, but using teacher outputs as labels. The student's loss is its divergence from the teacher's output, not from a human-labeled gold answer.

4. Optional task focus: Distill on data from a specific domain (code, chat, math, customer support) for a specialized small model.

5. Evaluate: Compare student vs teacher on held-out benchmarks. Aim for 80–95% of teacher quality at <10% of cost.

Output vs Logit Distillation

AspectOutput (response) distillationLogit (soft-target) distillation
DataJust teacher's text outputsTeacher's full token probabilities
Access requiredAPI onlyNeed raw model weights
QualityGoodBetter (more signal per token)
CostCheapMore expensive (capture cost)
Use caseDistill from closed APIsDistill from open or own models

Output distillation is what most teams do because they don't have weight-level access to GPT-4 or Claude. Logit distillation is the academic standard but requires open models.

Famous Distilled Models

DistilBERT (Hugging Face, 2019): The original. 60% of BERT's size, 95% of its performance, 60% faster.

Alpaca / Vicuna (Stanford / LMSYS, 2023): Llama distilled from GPT-3.5 outputs. Made small instruction-following models cheap.

GPT-4o-mini, Claude Haiku, Gemini Flash: Reportedly distilled from their respective flagships, though details aren't public.

Llama 3.2 1B / 3B: Meta's small models distilled from larger Llama variants for on-device use.

DeepSeek-R1-Distill (2025): Open distilled versions of DeepSeek-R1's reasoning into smaller Llama and Qwen bases.

TinyLlama, Phi-3: Small models trained with distillation-style techniques to punch above their parameter weight.

When to Use Distillation

Cost-driven product: You need most of the quality but can't afford GPT-4 or Claude Opus on every request.

Latency-sensitive UX: Chat assistants where responses must be sub-second.

Specialization: A narrow task (intent classification, JSON extraction, code completion) where a small fine-tuned model beats the general flagship.

On-device or air-gapped: Where running a 70B model is impossible.

High-volume batch processing: Document classification at millions per day — flagship models are too expensive.

When Not To Use It

You don't have enough teacher data: Need thousands of high-quality (input, teacher-output) pairs minimum.

Open-ended creative tasks: Distilled models often lose nuance and creativity.

Frontier reasoning: Math, coding, and complex reasoning still benefit from running the actual frontier model.

Rapidly changing domains: A distilled model is a snapshot. If the domain changes weekly, the distillation lags.

Trade-offs

Quality ceiling: Student can't exceed teacher. Distillation transfers, doesn't create.

Brittleness on unfamiliar inputs: Small models generalize less. Out-of-distribution inputs degrade fast.

Bias inheritance: Teacher's biases (and errors, and hallucinations) are baked into the student.

API cost upfront: Distilling from a closed API requires paying for thousands of teacher inferences during data generation.

Compliance risk: Some closed-API ToS forbid using outputs to train competing models. Read the terms.

Common Mistakes

Distilling without evaluation: Without held-out benchmarks, you can't tell if the student matches the teacher.

Tiny student, complex teacher: A 1B student can't capture all of a 175B teacher's behavior. Match scale to ambition.

Skipping data quality: Bad teacher outputs (hallucinated, off-task) become baked-in bad student behavior.

No specialization: Distilling a general model from a general model often produces a worse general model. Distill for a task.

Compliance blind spots: Quietly training on competitors' API outputs is a legal time bomb. Confirm ToS.

Sources: