What Is Fine-Tuning? | GEO Glossary

Fine-tuning is the technique of further training a pre-trained LLM on domain- or task-specific data to shape its style, knowledge, and behavior. It's how you turn a general-purpose model into a brand-specific or industry-specific "custom GPT."

Why It Matters

Prompt engineering has limits. It repeats the same instructions on every request, consumes context window, and can't fully lock in consistent style. Fine-tuning updates the model's weights, so the learned behavior is baked in without explicit instructions. OpenAI research shows fine-tuned models average 20–30% higher accuracy on specialized tasks compared to prompting alone.

Types of Fine-Tuning

Full fine-tuning: Updates every parameter. Highest performance, but the most expensive in compute and storage.

LoRA (Low-Rank Adaptation): Keeps original weights frozen and trains small adapter layers. About 1/100 the training cost, and you can swap LoRA adapters as needed. The most widely used approach in 2026.

PEFT (Parameter-Efficient Fine-Tuning): Umbrella term for LoRA, Adapters, Prefix-Tuning, and similar methods — training only a small subset of parameters.

RLHF / DPO: Tunes response quality using human feedback or preference comparisons. The core alignment technique behind ChatGPT and Claude.

SFT (Supervised Fine-Tuning): The most basic form — training on labeled input-output pairs. Effective for teaching specific formats or tones.

Fine-Tuning vs Prompting vs RAG

These approaches are complementary, not competitive.

Goal	Best approach
Consistent style/tone	Fine-tuning
Format or language adherence	Fine-tuning or prompting
Real-time fresh info	RAG
Company internal docs	RAG
Deep domain knowledge (medical, legal)	Fine-tuning + RAG
One-off or changing tasks	Prompting

Rule of thumb: If prompting solves it, fine-tuning is overkill. Reach for fine-tuning only when you're repeating the same instructions constantly or can't get consistent tone.

Practical Tips

Data quality is everything: 1,000 high-quality examples beat 10,000 noisy ones. Consistency and diversity of labels decide final performance.

Minimum data size: OpenAI recommends at least 50–100 examples; 500–1,000 is typical in practice. LoRA works with less.

Hold out a validation set: Reserve 10–20% of data to detect overfitting.

Start from the smallest capable base model: A well-tuned small model often beats a prompted large model in both speed and cost.

Define evaluation metrics first: Decide how you'll measure accuracy, style consistency, and factuality before training so you can track improvement.

Sources: