Fine-Tuning
Fine-tuning is the technique of further training a pre-trained LLM on domain- or task-specific data to shape its style, knowledge, and behavior. It's how you turn a general-purpose model into a brand-specific or industry-specific "custom GPT."
Fine-tuning is the technique of further training a pre-trained LLM on domain- or task-specific data to shape its style, knowledge, and behavior. It's how you turn a general-purpose model into a brand-specific or industry-specific "custom GPT."
Why It Matters
Prompt engineering has limits. It repeats the same instructions on every request, consumes context window, and can't fully lock in consistent style. Fine-tuning updates the model's weights, so the learned behavior is baked in without explicit instructions. OpenAI research shows fine-tuned GPT-4o averages 20–30% higher accuracy on specialized tasks compared to prompting alone.
Types of Fine-Tuning
Full fine-tuning: Updates every parameter. Highest performance, but the most expensive in compute and storage.
LoRA (Low-Rank Adaptation): Keeps original weights frozen and trains small adapter layers. About 1/100 the training cost, and you can swap LoRA adapters as needed. The most widely used approach in 2026.
PEFT (Parameter-Efficient Fine-Tuning): Umbrella term for LoRA, Adapters, Prefix-Tuning, and similar methods — training only a small subset of parameters.
RLHF / DPO: Tunes response quality using human feedback or preference comparisons. The core alignment technique behind ChatGPT and Claude.
SFT (Supervised Fine-Tuning): The most basic form — training on labeled input-output pairs. Effective for teaching specific formats or tones.
Fine-Tuning vs Prompting vs RAG
These approaches are complementary, not competitive.
| Goal | Best approach |
|---|---|
| Consistent style/tone | Fine-tuning |
| Format or language adherence | Fine-tuning or prompting |
| Real-time fresh info | RAG |
| Company internal docs | RAG |
| Deep domain knowledge (medical, legal) | Fine-tuning + RAG |
| One-off or changing tasks | Prompting |
Rule of thumb: If prompting solves it, fine-tuning is overkill. Reach for fine-tuning only when you're repeating the same instructions constantly or can't get consistent tone.
Practical Tips
Data quality is everything: 1,000 high-quality examples beat 10,000 noisy ones. Consistency and diversity of labels decide final performance.
Minimum data size: OpenAI recommends at least 50–100 examples; 500–1,000 is typical in practice. LoRA works with less.
Hold out a validation set: Reserve 10–20% of data to detect overfitting.
Start from the smallest capable base model: A well-tuned small model often beats a prompted large model in both speed and cost.
Define evaluation metrics first: Decide how you'll measure accuracy, style consistency, and factuality before training so you can track improvement.
Sources: