GEO

Instruction Tuning

Instruction tuning is the post-training process of fine-tuning a base LLM on thousands of (instruction, desired response) pairs so it learns to follow natural-language instructions rather than simply continue text. It's the step that turns a raw language model — good at predicting the next word — into an assistant that understands "summarize this," "translate to Korean," or "write a SQL query."

Instruction tuning is the post-training process of fine-tuning a base LLM on thousands of (instruction, desired response) pairs so it learns to follow natural-language instructions rather than simply continue text. It's the step that turns a raw language model — good at predicting the next word — into an assistant that understands "summarize this," "translate to Korean," or "write a SQL query."

Why It Matters

A base model trained only on next-token prediction is surprisingly hard to use. Ask a raw GPT-3 base model "What is the capital of France?" and it might respond with "What is the capital of Italy? What is the capital of Spain?" — continuing the pattern of similar questions rather than answering. Instruction tuning changed this. Google's FLAN (2021), OpenAI's InstructGPT (2022), and Anthropic's Claude used instruction tuning to create models that actually answer. Every modern chat-oriented LLM — GPT-4, Claude, Gemini, Llama Instruct, Mistral Instruct — has been instruction-tuned. Understanding this step explains why two models with similar base capabilities can feel dramatically different to use.

How It Works

1. Collect instruction data: Humans write (or curate) thousands of instruction-response pairs across diverse tasks — summarization, Q&A, coding, translation, math, creative writing, reasoning.

2. Format consistently: Each example follows a structure like:

### Instruction:
Summarize the following article in 3 bullets.
### Input:
[article text]
### Response:
- point 1
- point 2
- point 3

3. Supervised fine-tuning (SFT): Train the base model with standard next-token prediction loss over these formatted pairs. The model learns that, after seeing "### Instruction: ... ### Response:", it should generate the desired response.

4. Optional multi-task mixing: Datasets like FLAN, T0, and Dolly combine hundreds of task types so the model generalizes to unseen instructions.

5. Evaluate on held-out instructions: Measure whether the model follows new instructions it never saw during tuning.

Instruction Tuning vs Fine-Tuning vs RLHF

AspectFine-TuningInstruction TuningRLHF
DataTask-specific examplesDiverse (instruction, response) pairsHuman preference comparisons
LossNext-token predictionNext-token predictionReward model + PPO
GoalSpecialize on one taskGeneral instruction-followingAlign with human preferences
ExampleA model fine-tuned only on legal contractsFLAN, Alpaca, DollyChatGPT, Claude
DifficultyEasyMediumHard

In practice, modern chat models go through all three: base pretraining → instruction tuning (SFT) → RLHF (or DPO/constitutional AI). Instruction tuning is the middle layer — the point where a model becomes usable but isn't yet aligned on preferences like helpfulness, safety, and honesty.

Famous Instruction-Tuned Models

FLAN-T5 (Google, 2022): One of the first open-source instruction-tuned models. Showed that a 3B model with instruction tuning could beat a 175B model without it.

Alpaca (Stanford, 2023): Fine-tuned Llama 7B on 52K instruction examples generated by GPT-3.5. Demonstrated that instruction tuning is cheap and effective even for small models.

Dolly (Databricks, 2023): Fine-tuned on 15K human-written instructions. Proved high-quality data beats quantity.

Llama Instruct / Mistral Instruct: Open-weight instruction-tuned versions released alongside their base models.

Open-Instruct and Tulu (AI2): Research-focused instruction-tuned models emphasizing transparency.

Trade-offs

Data quality > quantity: 15K carefully written examples can beat 500K auto-generated ones. Alpaca vs Dolly showed this.

Narrow vs broad coverage: Covering more task types improves generalization but can hurt performance on any single task.

Format sensitivity: Instruction-tuned models expect a specific prompt format. Using the wrong one degrades performance noticeably.

Hallucination risk: If instruction data contains ungrounded answers, the model learns to fabricate confidently.

Cost: A few hundred to a few thousand dollars of GPU time for small models; much more for frontier-scale.

Common Mistakes

Confusing it with RLHF: They're different steps. A model can be instruction-tuned without RLHF (and many open models are) but will miss the preference alignment.

Using a raw base model as a chat model: Base models don't follow instructions reliably. Always use the instruction-tuned or chat variant for assistant tasks.

Mixing prompt formats across models: Each instruction-tuned model has its own expected format. Llama's isn't Mistral's isn't OpenAI's.

Training on your own domain and losing general capability: Narrow fine-tuning on top of an instruction-tuned model can erase instruction-following. Use LoRA and evaluate broadly.

Forgetting evaluation: Human judgment or LLM-as-a-judge on held-out prompts is the only way to verify instruction tuning actually worked.

Sources: