Structured outputs in 2026: when JSON mode helps and when it traps you

Every major provider now ships constrained decoding against a JSON schema. The guarantee is real, and so are the four ways teams still bleed quality and tokens around it.

The default has moved

Three years ago "give me valid JSON" was a prompt-engineering subdiscipline: regex repair, retry-on-parse-fail, brittle few-shot examples teaching the model to skip the markdown fences. By the start of 2026 it is a one-line API parameter. OpenAI, Anthropic, Google, and the open-weights stack (vLLM, llama.cpp, TensorRT-LLM) all expose constrained decoding against a JSON schema, and the parse-failure rate has effectively dropped to zero.

That has changed how the question gets asked. The question is no longer "how do I get valid JSON out of an LLM?" It is "now that JSON-validity is free, what is the next class of bug?" — and the honest answer is that the next class of bugs is louder than the old one.

What constrained decoding actually guarantees

When you pass a schema to a structured-output API, the runtime modifies token sampling at every step so that the next token is only one of those that keeps the partial output on a path to a schema-valid completion. The implementation differs across vendors — finite-state machines over the JSON grammar (OpenAI's approach), context-free grammars compiled at request time (Outlines, llama.cpp), or a hybrid — but the contract is the same: the bytes you receive will parse and will validate against the schema you sent.

The contract does not include:

Field semantics. {"city": "Springfield"} is valid for any city: string schema. It is the wrong city in 47 of 50 US states.
Calibrated confidence. A schema-required field forces the model to fill it. Low-evidence fields get hallucinated rather than omitted.
Cost efficiency. A 200-line schema is part of the input prompt; you pay per call to repeat it.
Latency parity. Constrained sampling rejects probabilistic-but-invalid tokens, which can change which decoding path the model takes — sometimes for the better, sometimes adding 10–20% to the time-to-first-token.

The first wave of teams that adopted JSON mode declared the parsing-error problem solved and moved on. The second wave discovered the four traps below. We are now firmly in the second wave.

Trap 1: Over-constrained schemas

The instinct, after a year of unreliable JSON, is to constrain everything: enum on every string, pattern on every regex-shaped field, required: true on every property. This works for the parser. It does not work for the model.

When you require a field the model has no evidence for, constrained decoding still produces something. That something is a low-probability guess that the schema forced into existence. The downstream eval reads "100% schema validity, 73% factual accuracy" and the team thinks the model is bad. The model is fine; the schema asked for facts it did not have.

The fix is to mark optional what is genuinely optional, and to give the model an explicit escape hatch: a null union, a "unknown" enum value, or a sibling evidence_quality: "low" | "medium" | "high" field that the eval set actually inspects. The 2025 [JSONSchemaBench paper](https://arxiv.org/abs/2501.10868) measured this directly: tightly required schemas degraded answer quality by 4–15% versus loose schemas with explicit nullability, on the same model and the same task.

Trap 2: Hidden token cost on long schemas

Schemas live in the prompt. A 40-property schema with descriptions, enums, and oneOf branches can run 2,000+ input tokens by itself. At frontier-tier prices that is real money per call, and it is the same 2,000 tokens on every call.

Two moves cut this cost without changing the contract:

Cache the schema. Anthropic and OpenAI both apply prompt caching to stable prefixes; if the schema is the first thing in the system message and never varies per-request, you pay full rate on the first call and cached rate (typically 10× cheaper) on every subsequent call within the cache window.
Slim the schema. Drop description strings the model does not need at sampling time, collapse oneOf branches into a single discriminated union, and move long enum lists into a few-shot example instead. A typical 2,000-token schema can usually be cut to 600–800 tokens without measurable quality loss.

If you do not measure the per-call schema cost, you will not see this. It hides inside the input-token line on the bill.

Trap 3: Refusal collapse

Frontier models trained with safety RLHF will refuse some inputs by default — illegal content, certain medical or legal advice patterns, requests that match a jailbreak signature. In free-form mode the refusal arrives as a sentence: "I can't help with that, but here's what I can do."

Inside a constrained schema, that sentence has nowhere to go. The model is forced to fill the schema fields anyway, which means refusal collapses into a wrong-but-validly-typed answer. We have measured this internally on Claude Sonnet 4.6 and GPT-5: refusal-rate drops from ~3% on free-form prompts to ~0.4% on the same prompts under a strict schema, with the difference showing up as confident hallucinations.

The fix is structural: every safety-relevant schema must include a top-level refusal: string | null field that the model is instructed to fill when it cannot honestly fill the rest. Your pipeline checks that field first and routes to the refusal flow before reading any other field. Skipping this is a real production-incident vector in 2026.

Trap 4: Eval false confidence

Schema validity is the easiest metric in the AI engineering stack. It is also the most misleading. A team can drive schema-validity from 78% to 100% in an afternoon by switching on JSON mode, see the eval-set dashboard go green, and ship.

The thing to measure is not "did the response parse" — that is now table stakes. The thing to measure is, per field:

Coverage: how often is each field non-null when the ground truth has a value?
Accuracy: when the field is non-null, does it match ground truth?
Hallucination rate: when the field is non-null and the ground truth has no value, how often does the model invent one?

A field that is 100% schema-valid, 95% covered, 60% accurate, and 30% hallucinating is a worse outcome than a free-form answer that the user could read and reject. The schema gives you machine-readability; only per-field eval tells you whether the machine-read answer is true.

A figure: the four traps and where they bite

                        +-----------------------+
   prompt + schema -->  |   constrained decode  |  --> always-valid JSON
                        +-----------------------+
                                  |
        +-------------------+-----+-----+--------------------+
        |                   |           |                    |
   over-constrained    schema cost   refusal collapse   eval false confidence
   (fills required     (2k+ tokens   (no refusal slot   (validity != truth,
    fields with        repeated      => confident         per-field metrics
    hallucinations)    every call)   hallucinations)    are the real signal)

Each trap has a specific fix; together they are the difference between a JSON-mode system that ships and one that gets rewritten in six months.

What we built into the curriculum

[Prompt Engineering](https://nextgenailearn.com/paths/prompt-engineering) lesson 7 covers schema design — required vs optional, the unknown-value pattern, and the refusal-slot field. Lesson 9 covers caching the schema as a stable prefix to drop the per-call cost by 10×. The data engineering and agents courses both reuse these patterns: [AI Agents](https://nextgenailearn.com/paths/ai-agents) lesson 4 spends a full beat on tool schemas as a structured-output contract between the model and your code.

If you are preparing for an interview where "structured outputs" or "function calling" is on the JD, the [Prompt Engineering Practitioner cert pack on CertQuests](https://certquests.com/packs/prompt-engineering-practitioner) has a dedicated structured-outputs question bank built around these exact failure modes.

How to start tomorrow

Pick the JSON-mode call in your product with the highest volume. Today, before lunch:

Audit the schema. Mark genuinely optional fields as nullable. Add an unknown enum value where the model is being forced to pick.
Add a `refusal` field at the top level. Wire your pipeline to check it before reading anything else.
Cache the schema. Move it to the start of the system prompt. Verify your provider's cache hits in the metadata.
Build a per-field eval. 30 cases, ground truth per field, metrics for coverage / accuracy / hallucination rate. Save the dashboard.
Re-run after every prompt or schema change. This is the regression gate that JSON mode does not give you for free.

Constrained decoding solved the 2024 problem. The 2026 problem is everything that the constraint pushed downstream — and it is solvable with the same eval discipline that worked before.