Ai Engineering 7 min read

LLM Observability: How to Monitor AI Applications

Traditional monitoring doesn't cover LLM applications. Here's what to log, how to trace multi-step chains, and how to detect quality regressions before users do.

Traditional application monitoring tracks latency, error rates, and uptime. That’s necessary but insufficient for LLM applications. A model can return a 200 status code, respond in 400ms, and still produce a completely wrong answer. The response was fast, the API call succeeded, and the output is garbage. Your existing monitoring wouldn’t flag it.

LLM observability extends traditional monitoring to cover the unique failure modes of generative AI: hallucinations, quality regressions, prompt drift, context retrieval failures, and cost overruns. If you’re running LLMs in production, you need all of these.

What to Log

Every LLM request should produce a structured log entry with at minimum:

FieldWhy it matters
TimestampCorrelation and debugging
Model and versionRegression tracking when models update
Full prompt (system + user + context)Reproducing issues, evaluating quality
Full completionQuality analysis, compliance
Input tokens / output tokensCost tracking, anomaly detection
Latency (TTFT and total)Performance monitoring
Temperature and other paramsReproducibility
Request IDTracing through multi-step chains
User/session IDPer-user debugging

Logging full prompts and completions raises storage and privacy concerns. For sensitive applications, log hashes or truncated versions in the main pipeline and store full content in a separate, access-controlled store with retention policies.

A structured log entry in practice:

{
  "trace_id": "tr_8f3a2b1c",
  "span_id": "sp_4d2e1f",
  "timestamp": "2026-03-18T14:32:01Z",
  "model": "gpt-5.4",
  "input_tokens": 1847,
  "output_tokens": 312,
  "latency_ms": 2340,
  "ttft_ms": 180,
  "temperature": 0.3,
  "cost_usd": 0.0043,
  "user_id": "usr_abc123",
  "task_type": "support_response",
  "eval_score": null
}

The eval_score field starts null and gets backfilled when your async evaluation pipeline runs. This separation keeps the logging path fast (no evaluation latency in the hot path) while still connecting quality scores to the original request.

Tracing Multi-Step Chains

A simple chat completion is one LLM call. A RAG pipeline involves embedding the query, searching the vector database, re-ranking results, constructing a prompt with retrieved context, and generating a response. An AI agent might chain 5-10 LLM calls with tool executions between them.

When something goes wrong in a multi-step chain, you need to know which step failed and why. Tracing connects the steps:

  1. Assign a trace ID to each user request.
  2. Each step (embedding, retrieval, LLM call, tool execution) creates a span within that trace.
  3. Each span records its inputs, outputs, duration, and any errors.
  4. The trace shows the full execution path with timing for every step.

This is the same concept as distributed tracing in microservices (OpenTelemetry, Jaeger), applied to LLM workflows. When a user reports a bad response, you pull up the trace and see that the retrieval step returned irrelevant documents, or that the model ignored the retrieved context, or that a tool call failed silently.

Eval-Driven Monitoring

The hardest part of LLM monitoring is measuring quality, not just availability. A response can be syntactically valid, confidently stated, and factually wrong. You can’t catch this with status codes.

Evaluation-driven monitoring runs automated quality checks on a sample of production traffic:

LLM-as-judge. Use a separate model to score responses on criteria like relevance, accuracy, helpfulness, and safety. This is the approach covered in How to Evaluate AI Output. Run it on a random sample (5-10% of traffic) to keep costs manageable.

Reference-based checks. For RAG applications, verify that the response is grounded in the retrieved context. If the model generates claims not supported by the provided documents, that’s a hallucination you can detect automatically.

Format validation. For structured output applications, validate that responses match the expected schema. Schema violations are easy to detect and often indicate model degradation.

Regression tests. Maintain a set of known input-output pairs. Run them periodically (daily or on every prompt change) and compare against expected outputs. This catches regressions from model updates, prompt changes, or infrastructure shifts.

Drift Detection

LLM behavior changes over time for reasons outside your control: provider model updates, changes in input patterns, seasonal shifts in user behavior, and gradual accumulation of conversation context.

Monitor these signals for drift:

Output length distribution. If your model suddenly starts generating 2x longer responses, something changed. This also directly impacts cost.

Topic distribution. Track the semantic clustering of queries over time. A sudden shift in what users are asking about might require prompt or retrieval adjustments.

Confidence scores. If you use logprobs or model-reported confidence, track the distribution. A drop in average confidence across requests can indicate the model is encountering inputs outside its comfort zone.

Refusal rate. Track how often the model declines to answer. A spike in refusals after a model update might mean the safety filters became more aggressive, affecting legitimate use cases.

Alerting

Not every metric needs an alert. Focus on signals that indicate user-facing degradation:

  • Latency P95 above threshold: the model or provider is slow, users are waiting
  • Error rate above baseline: API failures, rate limits, or model errors
  • Cost per request spike: a prompt change or context issue is inflating token counts
  • Eval score drop: quality is degrading based on automated judges
  • Hallucination rate increase: for RAG applications, the grounding check is failing more often

Set alerts on rolling windows (15-minute or hourly averages), not individual requests. Individual requests are noisy. Sustained shifts are meaningful.

Tools

The LLM observability space has matured quickly. Current options range from hosted platforms to open-source libraries:

ToolTypeStrength
LangfuseOpen-source, self-hostableTracing, evals, prompt management
HeliconeHosted proxyZero-code integration, cost tracking
Arize PhoenixOpen-sourceTraces, evals, retrieval analysis
BraintrustHosted platformEvals, logging, prompt playground
OpenTelemetry + customDIYFull control, fits existing infra

If you already have an observability stack (Datadog, Grafana, New Relic), start by shipping LLM metrics there rather than adding a new tool. Custom spans for LLM calls, token counts as metrics, and prompt/completion logs as structured events integrate naturally into existing dashboards.

Handling Model Version Changes

Model providers update models without warning. OpenAI might update the weights behind a model alias, Anthropic might adjust safety tuning, and Google might change how Gemini handles edge cases. These silent updates are one of the most common causes of unexplained quality regressions.

Defend against this:

  • Pin model versions where possible. Use gpt-5.4-2026-03-05 instead of gpt-5.4 to avoid surprise changes.
  • Log the exact model version returned in the API response, not just what you requested. Some providers return the resolved version in response headers or metadata.
  • Run your regression suite weekly even when you haven’t changed your code. If scores drop and you haven’t deployed, the model changed.
  • Maintain a rollback path. If the latest model version regresses on your use case, you should be able to switch to a previous version or an alternative provider quickly.

Start Simple

You don’t need every metric on day one. Start with:

  1. Log every request (model, tokens, latency, cost).
  2. Add trace IDs to connect multi-step workflows.
  3. Run a basic eval (LLM-as-judge) on 5% of traffic.
  4. Alert on latency, error rate, and cost anomalies.

That covers the most common production failures. Add drift detection, regression tests, and deeper eval coverage as your application matures and your traffic patterns stabilize.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading