GEO

LLM Observability

LLM observability is the practice of instrumenting production LLM applications so teams can see what the model is doing, debug failures, measure cost and latency, detect quality drift, and evaluate outputs over time. It's the LLM-era equivalent of traditional app observability — logs, traces, and metrics — adapted to probabilistic systems where the same input can produce different outputs.

LLM observability is the practice of instrumenting production LLM applications so teams can see what the model is doing, debug failures, measure cost and latency, detect quality drift, and evaluate outputs over time. It's the LLM-era equivalent of traditional app observability — logs, traces, and metrics — adapted to probabilistic systems where the same input can produce different outputs.

Why It Matters

A traditional web app either works or throws an error. An LLM app can "work" (return a well-formatted response) while the answer is wrong, off-topic, hallucinated, biased, or just worse than yesterday. Without observability, those failures stay invisible until users complain — by which point trust is already damaged. 2024–2025 saw LLM observability become a distinct category, with tools like Langfuse, LangSmith, Helicone, Arize Phoenix, Weights & Biases Weave, and Braintrust each carving out a slice. For any team running LLMs in production, observability is now table stakes, not a nice-to-have.

What to Instrument

Traces: The full execution path — every prompt, retrieval call, tool invocation, and response in a single request. Lets you replay what the agent actually did.

Input/output pairs: The exact prompt sent and the exact completion received, versioned by prompt template.

Cost per request: Token count × price for input and output, per model. Aggregated by feature, user, or tenant.

Latency: Time to first token, total completion time, and time spent in each sub-step.

Errors and retries: Rate-limit errors, timeouts, tool-call failures, parse errors.

Quality signals: User thumbs up/down, implicit signals (copied output, ran code, sent message), and LLM-as-judge scores on recent outputs.

Drift: Changes over time in output distribution, answer quality, or tool-call rate — often the first signal that a model update or prompt change broke something.

Why It's Different from Traditional Observability

Outputs aren't deterministic: Same input, different output. Metrics have to handle variance as a first-class concept.

Costs are per-token, not per-request: Traditional APM doesn't know what tokens are. LLM observability must.

Quality is subjective: You can't assert "output correct" with a simple test. Evaluation needs human review, LLM judges, or ground-truth comparisons.

Prompts are code: A prompt change is a deploy. Without prompt versioning, you can't tell which version produced yesterday's bug.

Multi-step chains matter: Most LLM apps are pipelines. You need nested traces that mirror the call graph, not flat logs.

The Tooling Landscape (2026)

Langfuse (open source): Trace-first observability with eval, prompt management, and user feedback. Popular with self-hosting teams.

LangSmith (LangChain): Tightly integrated with LangChain. Strong for teams already on that stack.

Helicone: Lightweight proxy-based observability. One-line integration, easy to adopt.

Arize Phoenix / Arize AX: Comes from the ML observability world; strong on drift, embeddings, and eval science.

Braintrust: Eval-first platform, useful for teams that want to treat LLM development as an experimentation workflow.

Weave (Weights & Biases): Extends WandB's ML experiment tracking into LLM territory.

Datadog / New Relic LLM monitoring: Classic APM vendors adding LLM-specific dashboards.

OpenTelemetry GenAI semantic conventions: A cross-vendor standard for LLM tracing, gaining adoption in 2025–2026.

What to Watch

Cost per user session: Sudden spikes often mean a bug (retry loop, runaway agent) before they mean growth.

Latency p95/p99: Long tails kill UX. Worst-case matters more than average.

Eval score drift: A weekly LLM-as-judge score on representative prompts catches silent regressions after prompt or model changes.

Top failure modes: Categorize errors — refused, hallucinated, off-topic, bad-format — so you know where to invest.

Prompt version performance: Compare eval scores across prompt versions to know if the latest change helped or hurt.

Token distribution: Long responses drive cost. Unexpectedly long tails often indicate prompt drift or broken stop tokens.

Common Mistakes

Only logging errors: LLMs fail silently. Log successes too, with enough metadata to evaluate quality.

No sampling strategy: Logging 100% of requests at scale is expensive. Sample intelligently by user segment, cost tier, or recent change.

Not connecting traces to user feedback: Thumbs-down needs to link back to the exact trace that produced the output.

Siloed by team: Product, ML, and infra each build their own dashboards. Unified observability is the win.

Ignoring regression testing: "It looks fine" isn't enough. Build a regression eval set and run it before every prompt change.

Chasing vendor lock-in: OpenTelemetry GenAI conventions let you instrument once and swap observability vendors later.

Sources: