Skip to main content

Lesson 4 · 11 min

State and memory architecture

An LLM has no memory — everything the model knows must be in the request. Choosing where to store what (context window vs. cache vs. database vs. vector store) determines your application's cost, latency, and correctness.

The four storage tiers

AI applications have four places to keep state, each with different trade-offs:

| Tier | What lives here | Latency | Cost | Persistence |

|---|---|---|---|---|

| Context window | Current turn, retrieved docs, instructions | 0ms | Per-token | Ephemeral |

| Prompt cache | Stable prefix, system docs | 0ms (hit) | 10% of normal | 5 min TTL |

| In-memory / Redis | Session state, rate limit counters, job queue | <1ms | Low | Hours/days |

| Database | User profile, conversation history, preferences | 1–5ms | Lowest | Permanent |

| Vector store | Semantic knowledge base | 5–50ms | Low | Permanent |

Most bugs in production AI apps come from putting the wrong thing in the wrong tier.