Glossary
The AI vocabulary, in plain English.
51 terms you'll meet in real AI work. Each one links back to the lesson that covers it in depth.
Agent
An LLM in a loop with tools, a goal, and the ability to act.
ProductionPatterns: ReAct (reason → act), planner+executor, multi-agent. The hard parts are tool definitions, error recovery, plan re-evaluation, and not getting stuck in loops. Most production 'agents' are 2-3 step pipelines, not autonomous workers.
ANN (Approximate Nearest Neighbor)
Algorithms that find approximately-closest vectors orders of magnitude faster than exhaustive search.
RAGCommon variants: HNSW (graph-based, default in pgvector / Qdrant / FAISS), IVF (cluster-based), ScaNN (Google's hybrid). Trade ~1-5% recall for 10-1000x speedup.
Attention
The mechanism that lets a transformer "look at" every previous token when computing each new token.
FoundationsFor each token, the model computes Query/Key/Value projections and uses softmax-weighted sums of values to produce a context-aware representation. Self-attention is the architectural innovation behind transformers.
Bi-encoder
A model that encodes query and document independently, then compares with a similarity metric.
RAGFast (you can pre-compute doc embeddings) but limited — it can't model interactions between query and doc. Used for first-stage retrieval. Compare with cross-encoder (reranker).
BM25
A classic keyword-based scoring function used by Elasticsearch and most full-text search.
RAGStrong baseline for retrieval. Particularly good at exact identifiers (error codes, model numbers) where embeddings struggle. Combined with vector search → hybrid search.
Chain-of-thought (CoT)
Asking the model to reason out loud before answering.
PromptingSometimes helps a lot (multi-step logic, math). Sometimes does nothing (lookup, classification). On modern reasoning models with extended thinking, forcing visible CoT can hurt — let them think internally.
Chunking
Splitting documents into retrieval-friendly fragments before embedding them.
RAGMost important boring decision in RAG. Defaults: 400-800 tokens per chunk, 10-20% overlap, structure-aware (split on headers/sentences/function definitions, not raw character counts).
Context window
The maximum number of tokens (input + output) a model can process in a single call.
FoundationsModern models range from 4k to 1M+ tokens. Larger windows enable long documents and rich agent traces but cost grows roughly linearly in tokens, and quality often degrades in the middle of very long contexts.
Cosine similarity
A metric for how similar two vectors are by angle (ignoring magnitude).
RAGDefined as
dot(a, b) / (|a| × |b|). Range is [-1, 1]. For typical sentence embeddings, paraphrases land around 0.85–0.97; unrelated content around 0.05–0.30.Cross-encoder
A model that scores query/document pairs together, much more accurately than bi-encoder.
RAGUsed as a reranker on top of bi-encoder retrieval. Too slow to run on every doc in your corpus, perfect for re-scoring the top 30-50 candidates.
DPO (Direct Preference Optimization)
An alternative to RLHF that optimizes directly on preference pairs without an explicit reward model.
TrainingSimpler and often more stable than PPO-RLHF. Increasingly the default for open-source alignment.
Drift
When production data starts looking different from training data, degrading model quality.
ProductionTwo flavors: covariate drift (input distribution changes) and concept drift (the right answer for the same input changes). Detected via monitoring; remediated by re-training or alerting.
Embedding
A vector representation of a token or text fragment in a learned semantic space.
FoundationsRAGTrained such that semantically related items land nearby. Embedding similarity (usually cosine) is the basis of retrieval, clustering, and many lightweight classifiers.
Embedding model
A specialized model whose output is a single vector representing input text.
RAGDifferent from the LLM you generate with. Examples: text-embedding-3-small (OpenAI), voyage-3 (Voyage), BGE (open-source). For RAG, the embedding model choice often matters more than the generation model.
Eval / Evaluation
Automatically measuring how well a prompt or model performs against a labelled test set.
ProductionSkip eval and you're shipping vibes. Minimum: 20-50 inputs with expected outputs and a scoring function (exact match, regex, JSON-validity, LLM-as-judge). Track regressions per case, not just averages.
Faithfulness
A RAG metric: does the answer use only what's in the retrieved context?
RAGProductionAn unfaithful RAG hallucinates politely from training data instead of grounding in retrieved chunks. Measured via LLM-as-judge by checking each claim against the context.
Few-shot prompting
Including 3–5 input/output examples in the prompt to teach the model your format.
PromptingOften outperforms paragraphs of instructions. Choose examples adversarially — include edge cases, not just easy ones. The model is a pattern-matcher; pattern-match it.
Fine-tuning
Continuing a model's training on your own labelled data.
TrainingWorth it when you need a capability or style the base model lacks (formatting, vocabulary, niche tone). Rarely worth it for adding knowledge — RAG is usually cheaper and stays fresh.
Function calling / Tool calling
Native API mode that lets an LLM emit structured calls to functions you declare.
ProductionYou declare tools with JSON schemas; the model emits
{name, args}; you run them; you feed results back. The basis of every modern agent. Anthropic, OpenAI, Google all support it.Guardrails
Extra checks around an LLM that filter unsafe inputs or outputs.
ProductionSecurityInputs: PII redaction, prompt-injection detection. Outputs: toxicity filter, schema validation, fact-check against retrieved context. Layered defense — no single guardrail is sufficient.
Hallucination
When the model produces confident-sounding text that is factually wrong.
ProductionRAGCaused by: distribution mismatch, missing context, ambiguous prompt, training-cutoff knowledge gaps. Mitigations: RAG with strict 'use only context' instructions, eval, citations, output guardrails.
Hybrid search
Combining keyword (BM25) and vector search, then merging results.
RAGPure vector misses exact identifiers (error codes, names). Pure keyword misses paraphrases. Hybrid plus a reranker is the modern default for production RAG. Reciprocal Rank Fusion (RRF) is the standard combiner.
Inference
Running a trained model to produce outputs (vs *training*, which updates weights).
ProductionWhere most production cost lives. Optimization stack: serving framework (vLLM, TGI, Triton, SGLang), batching, KV-cache management, quantization, speculative decoding.
KV cache
Stored Key/Value vectors from previous tokens, reused so the model doesn't recompute them.
FoundationsProductionDuring inference, the cache grows linearly with context length — and so does memory and compute. Provider-side prompt caching lets multiple requests share the same KV state for stable prefixes.
LLM
Large Language Model — a transformer trained on text to predict the next token.
FoundationsGenerative model that maps a sequence of input tokens to a distribution over the next token. Modern LLMs (Claude, GPT, Gemini, Llama) range from a few billion to hundreds of billions of parameters and excel at language tasks because next-token prediction at scale turns out to require modelling syntax, semantics, and a surprising amount of world knowledge.
LLM-as-judge
Using a (usually stronger) LLM to score the output of another LLM against a rubric.
ProductionPractical for tasks where regex/exact-match doesn't work (summaries, rewrites, open-ended answers). Cache aggressively. Validate the judge against a small human-labelled subset to catch judge-specific biases.
LoRA
Low-Rank Adaptation. Fine-tune efficiently by updating small "adapter" matrices instead of the full model.
TrainingTrains in a fraction of the time and memory. QLoRA combines LoRA with 4-bit quantization, putting 7–13B fine-tunes within reach of consumer GPUs.
Lost in the middle
Empirical finding that LLMs attend less to information in the middle of long contexts.
ProductionRAGAccuracy U-shapes: high at the start and end, low in the middle. Mitigations: rerank before stuffing; deduplicate; place the best chunks at top and bottom of the context.
MLOps
The discipline of running ML systems in production.
ProductionInference serving, monitoring (latency, drift, accuracy), CI/CD for models, cost tracking, A/B testing, rollback strategy. Closer to SRE than to data science.
MoE (Mixture of Experts)
Architecture where only a subset of model parameters activates per token, via a learned router.
FoundationsMixtral, DeepSeek, GPT-4 are believed to use MoE. Trades static parameter count for compute efficiency — the model can be huge but only a fraction runs per token.
Multimodal
A model (or system) that processes and generates across more than one modality — text, image, audio, or video.
FoundationsModern frontier models (GPT-4o, Gemini 2.0, Claude with vision) accept images interleaved with text inside the same context window. A modality encoder — a ViT for images, a Whisper-style encoder for audio — converts raw pixels or waveforms into tokens the LLM backbone already understands.
The four most production-ready multimodal workloads in 2026: document understanding, chart/diagram Q&A, screenshot analysis, and video-frame captioning. For purely text tasks, use a text-only model — vision tokens cost 2–5× more per equivalent token at most providers.
Example: sending a screenshot of an error traceback to a vision-capable model and asking it to explain the failure is faster than manually transcribing the text.
Ready to validate these skills? The [Google Professional ML Engineer cert pack on CertQuests](https://certquests.com/google-professional-ml-engineer) covers multimodal architectures alongside core ML engineering topics.
Prompt caching
Reusing a model's computed KV state across requests for the stable prefix of a prompt.
ProductionAnthropic, OpenAI, and Google all support this. Putting the system prompt and few-shot examples first (variable user input last) makes the prefix stable — typical savings: 80–90% on the cached portion at a 5-minute TTL.
Prompt injection
An attack where untrusted input causes the model to abandon its intended task.
SecurityDirect: 'Ignore previous instructions...'. Indirect: hidden instructions in fetched content (web pages, emails). Mitigations: tagged delimiters, instruction-level guards, output filtering, privilege containment, red-teaming. There is no perfect defense yet — defense in depth.
Quantization
Storing model weights at lower precision (8-bit, 4-bit) to shrink size and speed up inference.
ProductionModern quantization (GPTQ, AWQ, GGUF) can hit 4-bit with surprisingly little quality loss. Critical for self-hosting big models on consumer GPUs.
RAG
Retrieval-Augmented Generation. Fetch relevant snippets, stuff into prompt, generate.
RAGA pattern, not a product. Two halves: an offline ingest (chunk → embed → store) and an online query (embed → retrieve → stuff → generate). Most failures come from chunking and retrieval, not the LLM.
Reasoning model
A model trained to produce internal reasoning before its final answer.
FoundationsExamples: Claude Opus extended thinking, OpenAI o-series. Burns tokens internally to reason, returns just the answer. Helpful for multi-step tasks; overkill for simple lookups.
Recall@k / Precision@k
Retrieval metrics: of the top-k results, how many are relevant (precision); of all relevant items, how many appear in top-k (recall).
RAGProductionBoth matter. High precision + low recall = you missed answers. High recall + low precision = your LLM has to wade through noise. Track both per query.
Reranker
A second-pass model that re-scores retrieved candidates by reading query + doc together.
RAGCross-encoders (the 'right tool' for reranking) score query/document pairs directly, modelling fine-grained interactions that bi-encoder vector similarity can't. Examples: Cohere Rerank, BGE Reranker. Two-stage retrieval (recall + rerank) is a major quality win.
RLHF
Reinforcement Learning from Human Feedback — the alignment technique behind ChatGPT.
TrainingTrain a reward model on human preference labels, then optimize the LLM (often via PPO) to score well. Newer methods (DPO, IPO) skip the explicit RL step. Critical for making raw next-token models actually useful.
Sampling
Choosing a token from the model's output probability distribution.
FoundationsPromptingGreedy = always pick the most likely. Temperature reshapes the distribution (low → deterministic, high → varied). Top-p / top-k clip the tail. For most production work, start with temperature 0–0.3 and only raise it for creative tasks.
Speculative decoding
Use a small fast model to draft tokens, then verify with the big model in parallel.
ProductionWhen the small model agrees with the big one (most of the time), you get the big model's quality at much higher speed. 2-4× throughput is typical.
Structured output
Forcing the model to output valid JSON (or another schema) — usually via API mode or strict prompting.
PromptingProductionNative structured-output / tool-use APIs guarantee valid JSON against a schema. Prompt-only JSON works with strict instructions ('output starts with {', 'no markdown fences') but is more brittle.
System prompt
Durable instructions placed at the top of the conversation that set role, tone, and constraints.
PromptingTreated by the model as higher-priority than user messages. Long, stable system prompts are the prime target for prompt caching.
Temperature
Sampling parameter that controls how random the model's output is.
PromptingLower (0.0–0.3) → deterministic, repeatable, good for code and extraction. Higher (0.7–1.0) → creative, varied, good for ideation. Above 1.0 generally hurts coherence.
Test-time compute
Letting the model think for longer at inference, instead of (only) scaling training.
FoundationsThe 2024-2026 frontier. Reasoning models, ensembling, multi-sample voting, beam search, agent self-correction — all variants of "spend more compute per query for higher quality".
Token
A unit of text the model actually sees — usually a sub-word.
FoundationsTokenizers (most use Byte-Pair Encoding) split input into integer IDs. A common rule of thumb is 1 token ≈ 0.75 English words ≈ 4 characters. Cost, context limits, and rate limits are all denominated in tokens — never in words.
Tool use
Letting an LLM call functions you define (search, DB query, send email, etc).
ProductionNative APIs (Anthropic, OpenAI) let you declare tool schemas. The model returns a tool call; you execute it; you feed results back. Privilege containment is critical — gate dangerous tools behind separate auth.
Top-p (nucleus sampling)
Only sample from the smallest set of tokens whose cumulative probability is ≤ p.
PromptingUseful safety net when temperature is non-zero — clips the long tail of unlikely tokens. Common default: 0.9–0.95.
Transformer
The neural-network architecture (introduced in 2017) that powers modern LLMs.
FoundationsA stack of identical blocks, each containing layer normalization, multi-head attention, and a feed-forward network — connected by residuals. Replaced RNN/LSTM as the dominant sequence model.
Vector database
A specialized index for "find the k nearest vectors" at scale.
RAGExamples: pgvector (Postgres), Pinecone, Weaviate, Qdrant, ChromaDB. Use approximate nearest neighbor (HNSW, IVF) for speed at the cost of 1-5% recall. You don't need one until you have ~100k+ chunks — a flat array is fine before that.
vLLM
An inference engine that uses PagedAttention to serve LLMs efficiently at scale.
ProductionStandard self-hosted serving stack alongside TGI (HuggingFace) and Triton (NVIDIA). Massive throughput improvements over naive Hugging Face Transformers serving.