GEO

LLM-as-a-Judge

LLM-as-a-Judge is an evaluation technique in which one language model scores or compares the outputs of another model (or its own earlier outputs) against a rubric. It replaces expensive human grading for tasks like open-ended QA, summarization, and chatbot responses.

LLM-as-a-Judge is an evaluation technique in which one language model scores or compares the outputs of another model (or its own earlier outputs) against a rubric. It replaces expensive human grading for tasks like open-ended QA, summarization, and chatbot responses.

Why It Matters

Evaluating generative output is the hardest part of shipping LLM features. Human review doesn't scale — grading 10,000 responses per week is unaffordable, and inter-rater agreement is often poor. The 2023 paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" showed that GPT-4 as a judge agrees with human experts at ~85% — roughly the same rate humans agree with each other. That's good enough to replace humans for most evaluation loops, unlocking continuous testing at a fraction of the cost.

How It Works

1. Define a rubric: Criteria like accuracy, completeness, tone, safety. Each with a scale (1–5) or binary pass/fail.

2. Prompt the judge: Give the judge model the input, the output to evaluate, and the rubric. Ask it to score and explain.

3. Pairwise or pointwise:

  • Pointwise: Score a single output on the rubric. Easier but more prone to scale drift.
  • Pairwise: Compare two outputs and pick a winner. More reliable because relative judgment is more stable than absolute scoring.

4. Aggregate: Average scores across many examples, track over time as you iterate.

Where It Works Well

A/B testing prompts: "Does v2 produce better answers than v1?" is a pairwise question LLM judges handle well.

RAG quality monitoring: Check that retrieved context is actually used and factually grounded.

Regression testing: Run the judge over a fixed eval set after every prompt change.

Red-teaming: A judge LLM scans for policy violations at scale.

Known Biases

Position bias: In pairwise comparisons, judges tend to favor the first response. Mitigate by swapping positions and averaging.

Verbosity bias: Longer responses are rated higher even when not better. Control for length in the rubric.

Self-preference: Models slightly prefer their own outputs. Use a different model as judge when possible.

Scale miscalibration: Judges compress scores toward the middle. Pairwise evaluation sidesteps this.

Prompt sensitivity: Small rubric wording changes flip results. Lock the judge prompt once it's validated.

Best Practices

Use a stronger model than the one being judged when possible.

Validate against human labels on a small seed set before trusting judge scores at scale.

Show the judge the rubric explicitly — don't assume it knows what "good" means.

Ask for reasoning first, then score (chain-of-thought) — judges score more reliably when forced to explain.

Prefer pairwise for high-stakes decisions, pointwise for cheap monitoring.

Sources: