LLM-as-a-Judge
LLM-as-a-Judge is an evaluation technique in which one language model scores or compares the outputs of another model (or its own earlier outputs) against a rubric. It replaces expensive human grading for tasks like open-ended QA, summarization, and chatbot responses.
LLM-as-a-Judge is an evaluation technique in which one language model scores or compares the outputs of another model (or its own earlier outputs) against a rubric. It replaces expensive human grading for tasks like open-ended QA, summarization, and chatbot responses.
Why It Matters
Evaluating generative output is the hardest part of shipping LLM features. Human review doesn't scale — grading 10,000 responses per week is unaffordable, and inter-rater agreement is often poor. The 2023 paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" showed that GPT-4 as a judge agrees with human experts at ~85% — roughly the same rate humans agree with each other. That's good enough to replace humans for most evaluation loops, unlocking continuous testing at a fraction of the cost.
How It Works
1. Define a rubric: Criteria like accuracy, completeness, tone, safety. Each with a scale (1–5) or binary pass/fail.
2. Prompt the judge: Give the judge model the input, the output to evaluate, and the rubric. Ask it to score and explain.
3. Pairwise or pointwise:
- Pointwise: Score a single output on the rubric. Easier but more prone to scale drift.
- Pairwise: Compare two outputs and pick a winner. More reliable because relative judgment is more stable than absolute scoring.
4. Aggregate: Average scores across many examples, track over time as you iterate.
Where It Works Well
A/B testing prompts: "Does v2 produce better answers than v1?" is a pairwise question LLM judges handle well.
RAG quality monitoring: Check that retrieved context is actually used and factually grounded.
Regression testing: Run the judge over a fixed eval set after every prompt change.
Red-teaming: A judge LLM scans for policy violations at scale.
Known Biases
Position bias: In pairwise comparisons, judges tend to favor the first response. Mitigate by swapping positions and averaging.
Verbosity bias: Longer responses are rated higher even when not better. Control for length in the rubric.
Self-preference: Models slightly prefer their own outputs. Use a different model as judge when possible.
Scale miscalibration: Judges compress scores toward the middle. Pairwise evaluation sidesteps this.
Prompt sensitivity: Small rubric wording changes flip results. Lock the judge prompt once it's validated.
Best Practices
Use a stronger model than the one being judged when possible.
Validate against human labels on a small seed set before trusting judge scores at scale.
Show the judge the rubric explicitly — don't assume it knows what "good" means.
Ask for reasoning first, then score (chain-of-thought) — judges score more reliably when forced to explain.
Prefer pairwise for high-stakes decisions, pointwise for cheap monitoring.
Sources: