LLM Observability: How to Monitor AI Applications
Traditional monitoring doesn't cover LLM applications. Here's what to log, how to trace multi-step chains, and how to detect quality regressions before users do.
Traditional application monitoring tracks latency, error rates, and uptime. That’s necessary but insufficient for LLM applications. A model can return a 200 status code, respond in 400ms, and still produce a completely wrong answer. The response was fast, the API call succeeded, and the output is garbage. Your existing monitoring wouldn’t flag it.
LLM observability extends traditional monitoring to cover the unique failure modes of generative AI: hallucinations, quality regressions, prompt drift, context retrieval failures, and cost overruns. If you’re running LLMs in production, you need all of these.
What to Log
Every LLM request should produce a structured log entry with at minimum:
| Field | Why it matters |
|---|---|
| Timestamp | Correlation and debugging |
| Model and version | Regression tracking when models update |
| Full prompt (system + user + context) | Reproducing issues, evaluating quality |
| Full completion | Quality analysis, compliance |
| Input tokens / output tokens | Cost tracking, anomaly detection |
| Latency (TTFT and total) | Performance monitoring |
| Temperature and other params | Reproducibility |
| Request ID | Tracing through multi-step chains |
| User/session ID | Per-user debugging |
Logging full prompts and completions raises storage and privacy concerns. For sensitive applications, log hashes or truncated versions in the main pipeline and store full content in a separate, access-controlled store with retention policies.
A structured log entry in practice:
{
"trace_id": "tr_8f3a2b1c",
"span_id": "sp_4d2e1f",
"timestamp": "2026-03-18T14:32:01Z",
"model": "gpt-5.4",
"input_tokens": 1847,
"output_tokens": 312,
"latency_ms": 2340,
"ttft_ms": 180,
"temperature": 0.3,
"cost_usd": 0.0043,
"user_id": "usr_abc123",
"task_type": "support_response",
"eval_score": null
}
The eval_score field starts null and gets backfilled when your async evaluation pipeline runs. This separation keeps the logging path fast (no evaluation latency in the hot path) while still connecting quality scores to the original request.
Tracing Multi-Step Chains
A simple chat completion is one LLM call. A RAG pipeline involves embedding the query, searching the vector database, re-ranking results, constructing a prompt with retrieved context, and generating a response. An AI agent might chain 5-10 LLM calls with tool executions between them.
When something goes wrong in a multi-step chain, you need to know which step failed and why. Tracing connects the steps:
- Assign a trace ID to each user request.
- Each step (embedding, retrieval, LLM call, tool execution) creates a span within that trace.
- Each span records its inputs, outputs, duration, and any errors.
- The trace shows the full execution path with timing for every step.
This is the same concept as distributed tracing in microservices (OpenTelemetry, Jaeger), applied to LLM workflows. When a user reports a bad response, you pull up the trace and see that the retrieval step returned irrelevant documents, or that the model ignored the retrieved context, or that a tool call failed silently.
Eval-Driven Monitoring
The hardest part of LLM monitoring is measuring quality, not just availability. A response can be syntactically valid, confidently stated, and factually wrong. You can’t catch this with status codes.
Evaluation-driven monitoring runs automated quality checks on a sample of production traffic:
LLM-as-judge. Use a separate model to score responses on criteria like relevance, accuracy, helpfulness, and safety. This is the approach covered in How to Evaluate AI Output. Run it on a random sample (5-10% of traffic) to keep costs manageable.
Reference-based checks. For RAG applications, verify that the response is grounded in the retrieved context. If the model generates claims not supported by the provided documents, that’s a hallucination you can detect automatically.
Format validation. For structured output applications, validate that responses match the expected schema. Schema violations are easy to detect and often indicate model degradation.
Regression tests. Maintain a set of known input-output pairs. Run them periodically (daily or on every prompt change) and compare against expected outputs. This catches regressions from model updates, prompt changes, or infrastructure shifts.
Drift Detection
LLM behavior changes over time for reasons outside your control: provider model updates, changes in input patterns, seasonal shifts in user behavior, and gradual accumulation of conversation context.
Monitor these signals for drift:
Output length distribution. If your model suddenly starts generating 2x longer responses, something changed. This also directly impacts cost.
Topic distribution. Track the semantic clustering of queries over time. A sudden shift in what users are asking about might require prompt or retrieval adjustments.
Confidence scores. If you use logprobs or model-reported confidence, track the distribution. A drop in average confidence across requests can indicate the model is encountering inputs outside its comfort zone.
Refusal rate. Track how often the model declines to answer. A spike in refusals after a model update might mean the safety filters became more aggressive, affecting legitimate use cases.
Alerting
Not every metric needs an alert. Focus on signals that indicate user-facing degradation:
- Latency P95 above threshold: the model or provider is slow, users are waiting
- Error rate above baseline: API failures, rate limits, or model errors
- Cost per request spike: a prompt change or context issue is inflating token counts
- Eval score drop: quality is degrading based on automated judges
- Hallucination rate increase: for RAG applications, the grounding check is failing more often
Set alerts on rolling windows (15-minute or hourly averages), not individual requests. Individual requests are noisy. Sustained shifts are meaningful.
Tools
The LLM observability space has matured quickly. Current options range from hosted platforms to open-source libraries:
| Tool | Type | Strength |
|---|---|---|
| Langfuse | Open-source, self-hostable | Tracing, evals, prompt management |
| Helicone | Hosted proxy | Zero-code integration, cost tracking |
| Arize Phoenix | Open-source | Traces, evals, retrieval analysis |
| Braintrust | Hosted platform | Evals, logging, prompt playground |
| OpenTelemetry + custom | DIY | Full control, fits existing infra |
If you already have an observability stack (Datadog, Grafana, New Relic), start by shipping LLM metrics there rather than adding a new tool. Custom spans for LLM calls, token counts as metrics, and prompt/completion logs as structured events integrate naturally into existing dashboards.
Handling Model Version Changes
Model providers update models without warning. OpenAI might update the weights behind a model alias, Anthropic might adjust safety tuning, and Google might change how Gemini handles edge cases. These silent updates are one of the most common causes of unexplained quality regressions.
Defend against this:
- Pin model versions where possible. Use
gpt-5.4-2026-03-05instead ofgpt-5.4to avoid surprise changes. - Log the exact model version returned in the API response, not just what you requested. Some providers return the resolved version in response headers or metadata.
- Run your regression suite weekly even when you haven’t changed your code. If scores drop and you haven’t deployed, the model changed.
- Maintain a rollback path. If the latest model version regresses on your use case, you should be able to switch to a previous version or an alternative provider quickly.
Start Simple
You don’t need every metric on day one. Start with:
- Log every request (model, tokens, latency, cost).
- Add trace IDs to connect multi-step workflows.
- Run a basic eval (LLM-as-judge) on 5% of traffic.
- Alert on latency, error rate, and cost anomalies.
That covers the most common production failures. Add drift detection, regression tests, and deeper eval coverage as your application matures and your traffic patterns stabilize.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Google DeepMind Unveils AGI Cognitive Evaluation Framework and Launches $200,000 Kaggle Hackathon
Google DeepMind introduced a 10-faculty framework for measuring AGI progress and opened a $200,000 Kaggle evaluation hackathon.
How to Reduce LLM API Costs in Production
LLM API costs add up fast in production. Here are the practical strategies that work: prompt caching, model routing, batching, output limits, and cost-per-task tracking.
How to Stream LLM Responses in Your Application
Streaming LLM responses reduces perceived latency and improves UX. Here's how server-sent events work, how to implement streaming with OpenAI and Anthropic, and what to watch for in production.
How to Evaluate AI Output (LLM-as-Judge Explained)
Traditional tests don't work for AI output. Here's how to evaluate quality using LLM-as-judge, automated checks, human review, and continuous evaluation frameworks.
How to Choose a Vector Database in 2026
Pinecone, Weaviate, Qdrant, pgvector, or Chroma? Here's how to pick the right vector database for your AI application based on scale, infrastructure, and actual needs.