Context engineering: the real skill behind every working LLM app

Prompt engineering is dead — long live context engineering. The shape of the context window is now the largest lever you have on output quality.

The shift nobody announced

Two years ago, the question was "how do I phrase the prompt?". Today, on every system that actually ships, the question is "what goes into the context window, in what order, with what weight?"

That's not prompting. That's context engineering — and it's the single highest-leverage skill on a modern AI engineering team.

The model is increasingly a commodity. The context you assemble around the request is not.

Five things you control in the context window

Every production LLM call assembles its context from at least five sources. Most failures come from getting one of them wrong.

1. The system prompt

Stable across requests. Defines voice, refusal posture, output schema, tool-use rules. Caches well — almost every provider gives you a 90% discount on cached system prefixes if you keep them stable.

Fail mode: stuffing per-user data here. Now your cache hit rate is zero and you're paying full price every call.

2. Retrieved knowledge

Documents pulled from a vector DB, web search, or a structured store. The shape of this matters more than the embedding model — chunk size, overlap, deduplication, and rerank rerank rerank.

Fail mode: retrieving 20 chunks because "more is better". The model attends to the first three and the last one. The middle 16 are wasted tokens with a precision cost.

3. Conversation history

The user's prior turns, the assistant's prior responses. Linear growth = linear cost growth. Summarization is a real engineering decision, not a flag you flip.

Fail mode: keeping every turn verbatim. By turn 30 you're sending 40k tokens for a "yes, please" reply.

4. Tool definitions and recent tool results

For agents: which tools are listed, in what order, with what arity. Recent tool results — especially errors — are gold for the model's next decision.

Fail mode: listing all 47 tools. The model attends to the top 5 and the bottom 2. The middle 40 are noise that increase hallucinated tool calls.

5. The actual user query

The thing the user typed. Surprisingly often, the smallest fraction of the context.

Fail mode: burying it under 30k tokens of "context" so the model loses the actual question.

The 80/20 of context engineering

Most teams I've audited get the same five things wrong:

Too much retrieval. 12 chunks instead of 4. Top-k is tuned by gut, not by eval.
No reranking. Cosine similarity is necessary but not sufficient — a cheap cross-encoder rerank halves your hallucinations.
Stale system prompt. Last edited 6 months ago. Half the rules don't apply to the current product.
Conversation summarization absent. The first 5 turns work great. By turn 20, latency triples and quality degrades.
Tool list bloat. Every feature added a tool. Nobody removed any. The agent now picks the wrong tool 30% of the time.

Fix two of these and most apps go from "demo-grade" to "ship-grade" overnight.

Context as a budget, not a buffer

The right mental model isn't "fill the context window." It's "spend a budget."

You have N tokens. Every token has an opportunity cost — both in dollars and in attention. Ask of every span in your prompt:

What does this token do for the answer?
If I removed it, would the eval score drop?
Could a shorter form do the same job?

This is the same discipline as performance engineering: measure, find the hot path, cut. The eval set tells you which spans matter.

The new evaluation question

It used to be "does this prompt work?". Now it's:

"Across the last 200 production traces, what was the average number of tokens spent on retrieval vs system prompt vs history? What was the precision@k of the retrieved chunks? What fraction of tool calls were on tools the agent actually needed?"

If you can't answer those, you don't have context engineering — you have hope.

What we built into the curriculum

We restructured three courses around this shift:

[Prompt Engineering](https://nextgenailearn.com/paths/prompt-engineering) lessons 8–11 now treat the prompt as one of five context sources, not the whole game. Lesson 11 walks through token-budget allocation.
[RAG & Vector Databases](https://nextgenailearn.com/paths/rag-vector-dbs) lessons 5–9 cover top-k tuning, reranking strategy, deduplication, and the precision/recall tradeoff with runnable JS.
[AI Agents](https://nextgenailearn.com/paths/ai-agents) lesson 4 is entirely on tool-list curation and the order-effects on tool selection.

Each one ends with a code-run beat where you measure your own context usage on a tiny eval set. Reading about token budgets is forgetting; computing your own retrieval precision and watching the hallucination rate drop is remembering.

How to start tomorrow

Pick one production LLM call. Today, before lunch:

Log the full context for the last 100 invocations. Real production traces, not test inputs.
Bucket the tokens by source: system, retrieved, history, tools, query.
Compute the share for each. If retrieval is >50% and you haven't reranked, that's your hot path.
Cut the lowest-value bucket by 30%. Re-run your eval set. Did quality drop? If not, you just saved money.

Repeat weekly. This is the workflow real GenAI teams have. It's not hard. It's just disciplined.

The model will keep getting cheaper. The context you put around it is your moat.