GEO

Speculative Decoding

Speculative decoding is an inference optimization in which a small, fast "draft" model predicts several tokens ahead, and the large target model then verifies them in a single parallel forward pass — accepting the ones that match what it would have generated and rejecting the rest. The user gets the exact same output as plain decoding, but 2–4× faster.

Speculative decoding is an inference optimization in which a small, fast "draft" model predicts several tokens ahead, and the large target model then verifies them in a single parallel forward pass — accepting the ones that match what it would have generated and rejecting the rest. The user gets the exact same output as plain decoding, but 2–4× faster.

Why It Matters

LLM generation is bound by sequential dependencies: each token has to wait for the previous one. Big models are bottlenecked not by raw compute but by memory bandwidth — moving their giant weights to the GPU compute units once per token. Speculative decoding breaks the sequential chain by batching multiple speculative tokens into one forward pass, dramatically reducing the number of expensive big-model calls. Google first published it in 2022; by 2024–2025 every major inference engine (vLLM, TensorRT-LLM, llama.cpp, together.ai) ships speculative decoding as a standard optimization, cutting serving costs 30–70% for the same output quality.

How It Works

1. Draft model proposes: A small, cheap model (say, a 1B-parameter twin of the 70B target) generates the next k tokens autoregressively. This is fast because the draft model's weights are small.

2. Target model verifies: The target model runs one forward pass over the k draft tokens in parallel, computing what it would have generated at each position.

3. Accept or reject: Starting from the first draft token, the target model accepts it if its own top choice (or a probability-matched sample) agrees, and keeps going until it disagrees.

4. Correct and continue: At the first disagreement, the target's token replaces the draft's. The process restarts from there.

5. Net effect: If the draft is right 70% of the time on average, the target model generates ~3× more tokens per forward pass, cutting latency proportionally.

Why It's Lossless

Done correctly, speculative decoding produces the exact same output distribution as plain decoding. The math works because the target model acts as a verifier: any token the draft proposes must pass the target's acceptance test, so the final sequence is identical to what the target would have generated alone. There's no quality trade-off — only a speed gain.

Variants

Vanilla speculative decoding (Google 2022): One draft model, one target. The original formulation.

Medusa: Adds multiple "heads" to the target model itself that predict several tokens ahead, eliminating the need for a separate draft model. Simpler deployment.

EAGLE: A more accurate variant that uses the target model's own internal representations to draft, achieving higher acceptance rates than external drafts.

Tree speculative decoding: Draft multiple candidate token trees in parallel. Higher acceptance probability, more verification compute.

Self-speculative: Skip layers of the target model to form a cheap "draft" from the same weights.

When It Helps Most

Batch-of-one inference: Single-user interactive chat is memory-bound. Speculative decoding shines here.

Long outputs: The more tokens the model generates, the more the cumulative savings add up.

Repetitive structure: When output follows predictable patterns (code, JSON), draft acceptance rates are very high.

Cold hardware utilization: On GPUs that would otherwise idle while waiting for memory, speculation fills the compute gap.

When It Helps Less

Large batch serving: High-throughput workloads are already compute-bound, not memory-bound. Speculation adds overhead without saving much.

Very creative / random outputs: Low draft acceptance rates limit the speedup.

Tiny models: A 1B draft over a 3B target doesn't save much because the target is already cheap.

Short prompts with short answers: Overhead of setting up speculation dominates the gain.

Trade-offs

Extra model in memory: You now serve both the target and the draft. Memory footprint increases unless you use self-speculative.

Implementation complexity: Managing the verification loop, rejection sampling, and KV-cache rollback is non-trivial. Use a library.

Acceptance rate sensitivity: A poorly matched draft can actually slow things down if rejections dominate.

Cold start: The first few tokens don't benefit from speculation as the draft warms up.

Common Mistakes

Using a draft model from a different family: Llama draft for a Mistral target rarely accepts. The draft must be aligned with the target.

Too large a draft: A 7B draft under a 70B target has a great acceptance rate but costs too much to run. The draft should be 5–20% the target's size.

Ignoring KV cache rollback: Rejected tokens must roll back the target's KV cache. Forgetting this corrupts state.

Applying it to already-fast models: Haiku/Flash-tier models are memory-light. Speculation saves less.

Not measuring end-to-end: Benchmark the whole request path. Naive token-per-second gains sometimes disappear under load or when network latency dominates.

Sources: