Skip to main content
NNextGen AI Learn
All posts
9 min readproductionengineeringopinion

Four signals your APM does not catch: an LLM observability playbook for 2026

Standard observability tells you when a service is down. It does not tell you when refusal rate doubles, response length drifts, retrieval precision collapses, or a tool starts getting called twice as often. Those are the failure modes that actually ship to users.

The incident that does not page anyone

A retrieval index gets quietly re-built with a different chunking strategy. The service stays up. p95 latency is unchanged. Error rate is zero. The dashboard is green for eleven days. On day twelve a support engineer pastes a screenshot into the team channel: the AI assistant has been confidently citing the wrong document for a week.

Nothing in a standard APM caught it because nothing in a standard APM was looking at the right thing. Datadog, New Relic, and Honeycomb were built for services where "did the response come back" and "how fast" are the two questions that matter. For LLM features, "did it come back fast" is necessary and nowhere near sufficient.

Four LLM-specific signals catch the incidents an HTTP-200-and-fast service can still produce. None of them require a new vendor; all four can be added to whatever you already run.

Signal 1: Refusal rate

Frontier models refuse a fraction of requests — sometimes by design (safety RLHF), sometimes as a side effect of a prompt change you just shipped. A refusal looks like a normal 200 OK to your APM. To the user it is the feature not working.

What to instrument:

  • Tag every response with a refused: boolean flag. The simplest detector is a small classifier prompt on the response text against a fixed phrase set ("I cannot", "I am unable", "I will not"). Anthropic's API also exposes a stop_reason: "refusal" field on Claude 4.x models — read it directly when you have it.
  • Plot the daily refusal rate per feature. A jump from 0.4% to 3% inside 24 hours is almost always a regression you introduced.
  • Alert on a 7-day rolling rate that crosses 2× the baseline.

Most teams discover their first refusal-rate spike retroactively, from a screenshot. The instrumentation above turns that into a page within minutes.

Signal 2: Response length distribution

Length distribution is the cheapest leading indicator of model-behavior drift. When a model is suddenly more verbose, you are paying for it on the output-token line. When it is suddenly terser, the eval set will catch the quality drop hours or days later — but the length histogram catches the shift the same hour.

What to instrument:

  • Log output_tokens per request, per feature.
  • Compute the p50 and p95 on a rolling 1-hour window.
  • Alert when either moves by more than 25% versus the 7-day median.

In practice, the two events this catches most often are (a) a system-prompt change that quietly removed a "be concise" instruction, and (b) an upstream model version bump on a non-pinned alias. Both produce a clean step-function in the length histogram that no other signal will flag.

Signal 3: Retrieval precision

For any RAG-shaped feature, the model's correctness is upper-bounded by the relevance of what you retrieved. If retrieval precision falls, the model has nothing to be correct from — and it will confabulate without error.

The hard version of this signal is offline: a labelled eval set with known-relevant documents, computed nightly. The cheap version is online and gets you 70% of the value:

  • Sample ~1% of production queries.
  • For each, run a small LLM-as-judge prompt: given this query and these k retrieved chunks, what fraction are relevant?
  • Average across the sample. Plot the daily mean.

A 10-point drop in the daily mean is the signature of a retrieval regression — a new embedding model, a botched re-index, a chunk-size change. None of these would surface in HTTP-200 latency dashboards. Retrieval precision turns them into pageable events.

Signal 4: Tool-call distribution

Agents that call tools have a fourth signal that single-shot LLM features do not: the distribution of which tools get called and how often.

A healthy agent has a roughly stable tool-call mix for a given workload — say 45% search, 30% fetch_record, 20% send_message, 5% escalate. A bad prompt change can shift that to 80% search overnight because the model started doubting its first result; a buggy tool description can collapse send_message to near-zero. Both cost real money in extra tokens and real product damage in missed actions.

What to instrument:

  • Tag every model invocation with the tool name (or none).
  • Compute per-tool call rate as a fraction of total invocations on a rolling 1-hour window.
  • Alert when any tool's share moves by more than 30% relative to the 7-day median.

This signal also catches the most expensive class of agent bug: silent infinite loops, where the model keeps calling the same tool with slightly varied arguments. The call-rate ratio for that tool will spike obviously; the cost dashboard will follow.

The minimum trace schema

The four signals above all collapse onto one trace schema. Capture this per request and the rest follows:

trace_id           # uuid, propagated through agent steps
feature            # "ticket_summary", "doc_search", ...
model              # "claude-sonnet-4-6", "gpt-5", ...
input_tokens       # int
output_tokens      # int
cached_tokens      # int (prompt caching hits)
latency_ms         # int
stop_reason        # "end_turn" | "max_tokens" | "refusal" | "tool_use"
refused            # bool (derived from stop_reason or classifier)
retrieved_chunks   # list of {id, score} (RAG features only)
tool_calls         # list of {name, args_hash} (agent features only)
user_id_hash       # for cohort analysis without PII

Two principles, both load-bearing. First, all four signals are queries over this schema — refusal rate is avg(refused), length is p95(output_tokens), retrieval precision is computed from retrieved_chunks, tool-call distribution is grouped on tool_calls.name. Second, never log raw prompt or response text by default. Log content hashes, redacted samples on a 1% reservoir, and a switch to enable full-payload logging for a specific user when debugging a ticket. The 2024 [OpenAI ChatGPT trace leak](https://openai.com/index/march-20-chatgpt-outage/) is the canonical reminder that "store everything" is a regulatory and reputational hazard.

A figure: where each signal bites

                +-------------------+
   user query   |  LLM feature      |   green dashboards,
   ---------->  |  (HTTP-200, fast) |   silent quality decay
                +-------------------+
                          |
       +------------------+------------------+------------------+
       |                  |                  |                  |
   refusal rate     output length      retrieval precision   tool-call mix
   (safety / prompt  (verbosity drift,  (RAG features,       (agents,
    regression)       quiet model bump)  re-index decay)      loops + bugs)

What we built into the curriculum

The [Production LLM Observability course](https://nextgenailearn.com/paths/observability) walks through all four signals with runnable instrumentation, the trace schema above, hourly probe sets, and a four-step on-call playbook that contains incidents in minutes rather than hours. Lesson 4 is the on-call playbook; lesson 6 is the privacy-aware trace design that keeps the system audit-clean.

If "production LLM observability" is on the JD for a role you are interviewing for, the [Production LLM Observability cert pack on CertQuests](https://certquests.com/packs/production-llm-observability) has a focused question bank built around these four signals and the trace schema.

How to start tomorrow

Pick your highest-traffic LLM endpoint. Today, before lunch:

  1. Log the schema above for every call. Most of these fields are already in the provider response.
  2. Plot the four signals on whatever dashboarding tool you already use. No new vendor.
  3. Set the four alerts at the thresholds above (2× refusal, 25% length drift, 10-point precision drop, 30% tool-mix shift). They will all be wrong on day one — tune for a week, then trust them.
  4. Run an incident drill. Force a refusal-rate spike on staging by removing a safety carve-out from the system prompt. Time how long until the on-call sees it.

Observability for LLM features is not a different discipline from regular observability. It is the regular discipline plus four signals your APM is not opinionated about. Add them and the next time the index gets quietly re-built with a different chunking strategy, you find out from a page on day zero, not a screenshot on day twelve.

Try it.

The first lesson takes 8 minutes. No signup needed.

Start the first lesson