GEO

Test-Time Compute

Test-time compute (also called inference-time compute) is the practice of letting an LLM "think" longer at inference — generating more reasoning tokens, running multiple chains, or sampling many candidates and picking the best — to improve answer quality without retraining the model. Popularized by OpenAI's o1 and DeepSeek-R1 in 2024–2025, it moved reasoning from a training problem to a runtime dial.

Test-time compute (also called inference-time compute) is the practice of letting an LLM "think" longer at inference — generating more reasoning tokens, running multiple chains, or sampling many candidates and picking the best — to improve answer quality without retraining the model. Popularized by OpenAI's o1 and DeepSeek-R1 in 2024–2025, it moved reasoning from a training problem to a runtime dial.

Why It Matters

For most of the LLM era, the only way to make a model smarter was to train a bigger one with more data. Test-time compute broke that dependency. OpenAI's o1 showed that the same base model, given 10–30× more tokens to reason before answering, matches or beats much larger non-reasoning models on math, coding, and logic benchmarks. This reframes inference budgets: instead of "use the biggest model you can afford," teams now ask "how much thinking do I want to pay for on this query?" The economics of reasoning shifted — and so did product design, because reasoning quality is now tunable at the request level.

How It Works

Longer chain-of-thought: The model outputs hundreds or thousands of internal reasoning tokens before the visible answer. More thinking → better answers.

Multiple samples (self-consistency): Generate N different answers, pick the one the model reaches most often. Simple and effective on math.

Tree search / beam search: Explore multiple reasoning branches in parallel, prune the bad ones, extend the promising ones.

Process reward models: A second model scores each reasoning step and steers the primary model toward better paths. Used in OpenAI's process supervision.

Verifier-guided search: Generate many candidates, run an external verifier (unit tests, math checker, LLM judge), return the best.

Best-of-N + rerank: Simpler variant. Generate 16–64 candidates, rerank with a reward model, return the top one.

The Trade-off

Every test-time compute technique buys accuracy with latency and cost:

Latency: A response that takes 500ms without reasoning can take 5–30 seconds with heavy test-time compute.

Cost: Reasoning tokens cost as much as any other output tokens. An o1 answer with 10,000 thinking tokens costs ~30–50× a simple GPT-4o answer.

Diminishing returns: The accuracy-vs-compute curve flattens. Going from 1,000 to 10,000 reasoning tokens helps more than 10,000 to 100,000.

Not always helpful: Simple factual lookups and friendly chitchat don't benefit from reasoning. Forcing o1 on "what's the weather" wastes money.

When to Use It

Math and formal logic: Test-time compute helps hugely. Reasoning models beat base models by 20–40 points on GSM8K, MATH, AIME.

Code generation with tests: Generate, run tests, iterate. Verifier-guided search shines.

Multi-step planning: Agent decisions, complex instructions, multi-constraint optimization.

High-stakes single queries: Medical, legal, financial — where paying 5 seconds and $0.30 for a correct answer is cheap compared to the cost of wrong.

When Not To Use It

Chat UX under 1-second budgets: Latency tanks user experience.

Volume workloads: Inflation of 20–50× on tokens makes any high-volume endpoint uneconomic.

Simple retrieval or summarization: One-shot answers are fine; thinking longer doesn't help.

Open-ended creative writing: More deliberation makes outputs feel stiff.

Reasoning Models vs Regular Models

AspectRegular (GPT-4o, Claude 3.5)Reasoning (o1, R1, Claude Opus 4.6 thinking)
Response speedFastSlow
Token costLowHigh
Math / logicDecentExcellent
Creative writingStrongSometimes stilted
Chat UXIdealOverkill
Best useMost requestsHard queries

Model routing — sending simple queries to a fast model and hard queries to a reasoning model — is the standard production pattern.

Common Mistakes

Using reasoning models everywhere: Rapidly inflates cost and latency without improving most answers.

No budget limit on thinking tokens: An unbounded reasoning trace can eat thousands of dollars on one query.

Ignoring caching: Reasoning traces are often repetitive. Prompt caching can reduce cost substantially.

Skipping evaluation: Teams assume reasoning = better. For their specific domain, it may not — benchmark before committing.

Confusing thinking tokens with output: Users shouldn't see the reasoning trace unless they ask. It's internal monologue.

Sources: