What Is Test-Time Compute? | GEO Glossary

Test-time compute (also called inference-time compute) is the practice of letting an LLM "think" longer at inference — generating more reasoning tokens, running multiple chains, or sampling many candidates and picking the best — to improve answer quality without retraining the model. Popularized by OpenAI's o1 and DeepSeek-R1 in 2024–2025, it moved reasoning from a training problem to a runtime dial.

Why It Matters

For most of the LLM era, the only way to make a model smarter was to train a bigger one with more data. Test-time compute broke that dependency. OpenAI's o1 showed that the same base model, given 10–30× more tokens to reason before answering, matches or beats much larger non-reasoning models on math, coding, and logic benchmarks. This reframes inference budgets: instead of "use the biggest model you can afford," teams now ask "how much thinking do I want to pay for on this query?" The economics of reasoning shifted — and so did product design, because reasoning quality is now tunable at the request level.

How It Works

Longer chain-of-thought: The model outputs hundreds or thousands of internal reasoning tokens before the visible answer. More thinking → better answers.

Multiple samples (self-consistency): Generate N different answers, pick the one the model reaches most often. Simple and effective on math.

Tree search / beam search: Explore multiple reasoning branches in parallel, prune the bad ones, extend the promising ones.

Process reward models: A second model scores each reasoning step and steers the primary model toward better paths. Used in OpenAI's process supervision.

Verifier-guided search: Generate many candidates, run an external verifier (unit tests, math checker, LLM judge), return the best.

Best-of-N + rerank: Simpler variant. Generate 16–64 candidates, rerank with a reward model, return the top one.

The Trade-off

Every test-time compute technique buys accuracy with latency and cost:

Latency: A response that takes 500ms without reasoning can take 5–30 seconds with heavy test-time compute.

Cost: Reasoning tokens cost as much as any other output tokens. A GPT-5.5 answer with 10,000 thinking tokens costs ~30–50× the same answer with thinking off.

Diminishing returns: The accuracy-vs-compute curve flattens. Going from 1,000 to 10,000 reasoning tokens helps more than 10,000 to 100,000.

Not always helpful: Simple factual lookups and friendly chitchat don't benefit from reasoning. Forcing thinking mode on "what's the weather" wastes money.

When to Use It

Math and formal logic: Test-time compute helps hugely. Reasoning models beat base models by 20–40 points on GSM8K, MATH, AIME.

Code generation with tests: Generate, run tests, iterate. Verifier-guided search shines.

Multi-step planning: Agent decisions, complex instructions, multi-constraint optimization.

High-stakes single queries: Medical, legal, financial — where paying 5 seconds and $0.30 for a correct answer is cheap compared to the cost of wrong.

When Not To Use It

Chat UX under 1-second budgets: Latency tanks user experience.

Volume workloads: Inflation of 20–50× on tokens makes any high-volume endpoint uneconomic.

Simple retrieval or summarization: One-shot answers are fine; thinking longer doesn't help.

Open-ended creative writing: More deliberation makes outputs feel stiff.

Thinking Off vs Thinking On

By 2026 the old "reasoning model vs regular model" split has dissolved. Hybrid models with a thinking toggle — GPT-5.5 (Thinking), Claude Opus 4.8 (extended thinking), Gemini 3.5 (Deep Think) — are the standard, and the choice happens per mode, not per model.

Aspect	Thinking off (default response)	Thinking on (GPT-5.5 Thinking, extended thinking, R1)
Response speed	Fast	Slow
Token cost	Low	High
Math / logic	Decent	Excellent
Creative writing	Strong	Sometimes stilted
Chat UX	Ideal	Overkill
Best use	Most requests	Hard queries

Model routing — answering simple queries with thinking off and hard queries with thinking on — is the standard production pattern.

Common Mistakes

Using reasoning models everywhere: Rapidly inflates cost and latency without improving most answers.

No budget limit on thinking tokens: An unbounded reasoning trace can eat thousands of dollars on one query.

Ignoring caching: Reasoning traces are often repetitive. Prompt caching can reduce cost substantially.

Skipping evaluation: Teams assume reasoning = better. For their specific domain, it may not — benchmark before committing.

Confusing thinking tokens with output: Users shouldn't see the reasoning trace unless they ask. It's internal monologue.

Sources: