Test-Time Compute
Test-time compute (also called inference-time compute) is the practice of letting an LLM "think" longer at inference — generating more reasoning tokens, running multiple chains, or sampling many candidates and picking the best — to improve answer quality without retraining the model. Popularized by OpenAI's o1 and DeepSeek-R1 in 2024–2025, it moved reasoning from a training problem to a runtime dial.
Test-time compute (also called inference-time compute) is the practice of letting an LLM "think" longer at inference — generating more reasoning tokens, running multiple chains, or sampling many candidates and picking the best — to improve answer quality without retraining the model. Popularized by OpenAI's o1 and DeepSeek-R1 in 2024–2025, it moved reasoning from a training problem to a runtime dial.
Why It Matters
For most of the LLM era, the only way to make a model smarter was to train a bigger one with more data. Test-time compute broke that dependency. OpenAI's o1 showed that the same base model, given 10–30× more tokens to reason before answering, matches or beats much larger non-reasoning models on math, coding, and logic benchmarks. This reframes inference budgets: instead of "use the biggest model you can afford," teams now ask "how much thinking do I want to pay for on this query?" The economics of reasoning shifted — and so did product design, because reasoning quality is now tunable at the request level.
How It Works
Longer chain-of-thought: The model outputs hundreds or thousands of internal reasoning tokens before the visible answer. More thinking → better answers.
Multiple samples (self-consistency): Generate N different answers, pick the one the model reaches most often. Simple and effective on math.
Tree search / beam search: Explore multiple reasoning branches in parallel, prune the bad ones, extend the promising ones.
Process reward models: A second model scores each reasoning step and steers the primary model toward better paths. Used in OpenAI's process supervision.
Verifier-guided search: Generate many candidates, run an external verifier (unit tests, math checker, LLM judge), return the best.
Best-of-N + rerank: Simpler variant. Generate 16–64 candidates, rerank with a reward model, return the top one.
The Trade-off
Every test-time compute technique buys accuracy with latency and cost:
Latency: A response that takes 500ms without reasoning can take 5–30 seconds with heavy test-time compute.
Cost: Reasoning tokens cost as much as any other output tokens. An o1 answer with 10,000 thinking tokens costs ~30–50× a simple GPT-4o answer.
Diminishing returns: The accuracy-vs-compute curve flattens. Going from 1,000 to 10,000 reasoning tokens helps more than 10,000 to 100,000.
Not always helpful: Simple factual lookups and friendly chitchat don't benefit from reasoning. Forcing o1 on "what's the weather" wastes money.
When to Use It
Math and formal logic: Test-time compute helps hugely. Reasoning models beat base models by 20–40 points on GSM8K, MATH, AIME.
Code generation with tests: Generate, run tests, iterate. Verifier-guided search shines.
Multi-step planning: Agent decisions, complex instructions, multi-constraint optimization.
High-stakes single queries: Medical, legal, financial — where paying 5 seconds and $0.30 for a correct answer is cheap compared to the cost of wrong.
When Not To Use It
Chat UX under 1-second budgets: Latency tanks user experience.
Volume workloads: Inflation of 20–50× on tokens makes any high-volume endpoint uneconomic.
Simple retrieval or summarization: One-shot answers are fine; thinking longer doesn't help.
Open-ended creative writing: More deliberation makes outputs feel stiff.
Reasoning Models vs Regular Models
| Aspect | Regular (GPT-4o, Claude 3.5) | Reasoning (o1, R1, Claude Opus 4.6 thinking) |
|---|---|---|
| Response speed | Fast | Slow |
| Token cost | Low | High |
| Math / logic | Decent | Excellent |
| Creative writing | Strong | Sometimes stilted |
| Chat UX | Ideal | Overkill |
| Best use | Most requests | Hard queries |
Model routing — sending simple queries to a fast model and hard queries to a reasoning model — is the standard production pattern.
Common Mistakes
Using reasoning models everywhere: Rapidly inflates cost and latency without improving most answers.
No budget limit on thinking tokens: An unbounded reasoning trace can eat thousands of dollars on one query.
Ignoring caching: Reasoning traces are often repetitive. Prompt caching can reduce cost substantially.
Skipping evaluation: Teams assume reasoning = better. For their specific domain, it may not — benchmark before committing.
Confusing thinking tokens with output: Users shouldn't see the reasoning trace unless they ask. It's internal monologue.
Sources: