Glossary

The AI vocabulary, in plain English.

51 terms you'll meet in real AI work. Each one links back to the lesson that covers it in depth.

Agent
An LLM in a loop with tools, a goal, and the ability to act.
Production
Patterns: ReAct (reason → act), planner+executor, multi-agent. The hard parts are tool definitions, error recovery, plan re-evaluation, and not getting stuck in loops. Most production 'agents' are 2-3 step pipelines, not autonomous workers.
ANN (Approximate Nearest Neighbor)
Algorithms that find approximately-closest vectors orders of magnitude faster than exhaustive search.
RAG
Common variants: HNSW (graph-based, default in pgvector / Qdrant / FAISS), IVF (cluster-based), ScaNN (Google's hybrid). Trade ~1-5% recall for 10-1000x speedup.
Attention
The mechanism that lets a transformer "look at" every previous token when computing each new token.
Foundations
For each token, the model computes Query/Key/Value projections and uses softmax-weighted sums of values to produce a context-aware representation. Self-attention is the architectural innovation behind transformers.
→ Attention lesson
Bi-encoder
A model that encodes query and document independently, then compares with a similarity metric.
RAG
Fast (you can pre-compute doc embeddings) but limited — it can't model interactions between query and doc. Used for first-stage retrieval. Compare with cross-encoder (reranker).
BM25
A classic keyword-based scoring function used by Elasticsearch and most full-text search.
RAG
Strong baseline for retrieval. Particularly good at exact identifiers (error codes, model numbers) where embeddings struggle. Combined with vector search → hybrid search.
Chain-of-thought (CoT)
Asking the model to reason out loud before answering.
Prompting
Sometimes helps a lot (multi-step logic, math). Sometimes does nothing (lookup, classification). On modern reasoning models with extended thinking, forcing visible CoT can hurt — let them think internally.
→ CoT lesson
Chunking
Splitting documents into retrieval-friendly fragments before embedding them.
RAG
Most important boring decision in RAG. Defaults: 400-800 tokens per chunk, 10-20% overlap, structure-aware (split on headers/sentences/function definitions, not raw character counts).
→ Chunking lesson
Context window
The maximum number of tokens (input + output) a model can process in a single call.
Foundations
Modern models range from 4k to 1M+ tokens. Larger windows enable long documents and rich agent traces but cost grows roughly linearly in tokens, and quality often degrades in the middle of very long contexts.
→ Context windows lesson
Cosine similarity
A metric for how similar two vectors are by angle (ignoring magnitude).
RAG
Defined as dot(a, b) / (|a| × |b|). Range is [-1, 1]. For typical sentence embeddings, paraphrases land around 0.85–0.97; unrelated content around 0.05–0.30.
→ Cosine similarity lesson
Cross-encoder
A model that scores query/document pairs together, much more accurately than bi-encoder.
RAG
Used as a reranker on top of bi-encoder retrieval. Too slow to run on every doc in your corpus, perfect for re-scoring the top 30-50 candidates.
DPO (Direct Preference Optimization)
An alternative to RLHF that optimizes directly on preference pairs without an explicit reward model.
Training
Simpler and often more stable than PPO-RLHF. Increasingly the default for open-source alignment.
Drift
When production data starts looking different from training data, degrading model quality.
Production
Two flavors: covariate drift (input distribution changes) and concept drift (the right answer for the same input changes). Detected via monitoring; remediated by re-training or alerting.
Embedding
A vector representation of a token or text fragment in a learned semantic space.
FoundationsRAG
Trained such that semantically related items land nearby. Embedding similarity (usually cosine) is the basis of retrieval, clustering, and many lightweight classifiers.
→ Embeddings lesson
Embedding model
A specialized model whose output is a single vector representing input text.
RAG
Different from the LLM you generate with. Examples: text-embedding-3-small (OpenAI), voyage-3 (Voyage), BGE (open-source). For RAG, the embedding model choice often matters more than the generation model.
Eval / Evaluation
Automatically measuring how well a prompt or model performs against a labelled test set.
Production
Skip eval and you're shipping vibes. Minimum: 20-50 inputs with expected outputs and a scoring function (exact match, regex, JSON-validity, LLM-as-judge). Track regressions per case, not just averages.
→ Eval lesson
Faithfulness
A RAG metric: does the answer use only what's in the retrieved context?
RAGProduction
An unfaithful RAG hallucinates politely from training data instead of grounding in retrieved chunks. Measured via LLM-as-judge by checking each claim against the context.
Few-shot prompting
Including 3–5 input/output examples in the prompt to teach the model your format.
Prompting
Often outperforms paragraphs of instructions. Choose examples adversarially — include edge cases, not just easy ones. The model is a pattern-matcher; pattern-match it.
→ Few-shot lesson
Fine-tuning
Continuing a model's training on your own labelled data.
Training
Worth it when you need a capability or style the base model lacks (formatting, vocabulary, niche tone). Rarely worth it for adding knowledge — RAG is usually cheaper and stays fresh.
→ Fine-tuning path
Function calling / Tool calling
Native API mode that lets an LLM emit structured calls to functions you declare.
Production
You declare tools with JSON schemas; the model emits {name, args}; you run them; you feed results back. The basis of every modern agent. Anthropic, OpenAI, Google all support it.
Guardrails
Extra checks around an LLM that filter unsafe inputs or outputs.
ProductionSecurity
Inputs: PII redaction, prompt-injection detection. Outputs: toxicity filter, schema validation, fact-check against retrieved context. Layered defense — no single guardrail is sufficient.
Hallucination
When the model produces confident-sounding text that is factually wrong.
ProductionRAG
Caused by: distribution mismatch, missing context, ambiguous prompt, training-cutoff knowledge gaps. Mitigations: RAG with strict 'use only context' instructions, eval, citations, output guardrails.
Hybrid search
Combining keyword (BM25) and vector search, then merging results.
RAG
Pure vector misses exact identifiers (error codes, names). Pure keyword misses paraphrases. Hybrid plus a reranker is the modern default for production RAG. Reciprocal Rank Fusion (RRF) is the standard combiner.
→ Hybrid search lesson
Inference
Running a trained model to produce outputs (vs *training*, which updates weights).
Production
Where most production cost lives. Optimization stack: serving framework (vLLM, TGI, Triton, SGLang), batching, KV-cache management, quantization, speculative decoding.
KV cache
Stored Key/Value vectors from previous tokens, reused so the model doesn't recompute them.
FoundationsProduction
During inference, the cache grows linearly with context length — and so does memory and compute. Provider-side prompt caching lets multiple requests share the same KV state for stable prefixes.
→ KV cache lesson
LLM
Large Language Model — a transformer trained on text to predict the next token.
Foundations
Generative model that maps a sequence of input tokens to a distribution over the next token. Modern LLMs (Claude, GPT, Gemini, Llama) range from a few billion to hundreds of billions of parameters and excel at language tasks because next-token prediction at scale turns out to require modelling syntax, semantics, and a surprising amount of world knowledge.
→ LLMs & Transformers path
LLM-as-judge
Using a (usually stronger) LLM to score the output of another LLM against a rubric.
Production
Practical for tasks where regex/exact-match doesn't work (summaries, rewrites, open-ended answers). Cache aggressively. Validate the judge against a small human-labelled subset to catch judge-specific biases.
LoRA
Low-Rank Adaptation. Fine-tune efficiently by updating small "adapter" matrices instead of the full model.
Training
Trains in a fraction of the time and memory. QLoRA combines LoRA with 4-bit quantization, putting 7–13B fine-tunes within reach of consumer GPUs.
Lost in the middle
Empirical finding that LLMs attend less to information in the middle of long contexts.
ProductionRAG
Accuracy U-shapes: high at the start and end, low in the middle. Mitigations: rerank before stuffing; deduplicate; place the best chunks at top and bottom of the context.
MLOps
The discipline of running ML systems in production.
Production
Inference serving, monitoring (latency, drift, accuracy), CI/CD for models, cost tracking, A/B testing, rollback strategy. Closer to SRE than to data science.
→ Deployment & MLOps path
MoE (Mixture of Experts)
Architecture where only a subset of model parameters activates per token, via a learned router.
Foundations
Mixtral, DeepSeek, GPT-4 are believed to use MoE. Trades static parameter count for compute efficiency — the model can be huge but only a fraction runs per token.
Multimodal
A model (or system) that processes and generates across more than one modality — text, image, audio, or video.
Foundations
Modern frontier models (GPT-4o, Gemini 2.0, Claude with vision) accept images interleaved with text inside the same context window. A modality encoder — a ViT for images, a Whisper-style encoder for audio — converts raw pixels or waveforms into tokens the LLM backbone already understands.
The four most production-ready multimodal workloads in 2026: document understanding, chart/diagram Q&A, screenshot analysis, and video-frame captioning. For purely text tasks, use a text-only model — vision tokens cost 2–5× more per equivalent token at most providers.
Example: sending a screenshot of an error traceback to a vision-capable model and asking it to explain the failure is faster than manually transcribing the text.
Ready to validate these skills? The [Google Professional ML Engineer cert pack on CertQuests](https://certquests.com/google-professional-ml-engineer) covers multimodal architectures alongside core ML engineering topics.
→ LLMs & Transformers path → Embedding
Prompt caching
Reusing a model's computed KV state across requests for the stable prefix of a prompt.
Production
Anthropic, OpenAI, and Google all support this. Putting the system prompt and few-shot examples first (variable user input last) makes the prefix stable — typical savings: 80–90% on the cached portion at a 5-minute TTL.
→ Production patterns lesson
Prompt injection
An attack where untrusted input causes the model to abandon its intended task.
Security
Direct: 'Ignore previous instructions...'. Indirect: hidden instructions in fetched content (web pages, emails). Mitigations: tagged delimiters, instruction-level guards, output filtering, privilege containment, red-teaming. There is no perfect defense yet — defense in depth.
→ Prompt injection lesson
Quantization
Storing model weights at lower precision (8-bit, 4-bit) to shrink size and speed up inference.
Production
Modern quantization (GPTQ, AWQ, GGUF) can hit 4-bit with surprisingly little quality loss. Critical for self-hosting big models on consumer GPUs.
RAG
Retrieval-Augmented Generation. Fetch relevant snippets, stuff into prompt, generate.
RAG
A pattern, not a product. Two halves: an offline ingest (chunk → embed → store) and an online query (embed → retrieve → stuff → generate). Most failures come from chunking and retrieval, not the LLM.
→ RAG path
Reasoning model
A model trained to produce internal reasoning before its final answer.
Foundations
Examples: Claude Opus extended thinking, OpenAI o-series. Burns tokens internally to reason, returns just the answer. Helpful for multi-step tasks; overkill for simple lookups.
→ Reasoning models lesson
Recall@k / Precision@k
Retrieval metrics: of the top-k results, how many are relevant (precision); of all relevant items, how many appear in top-k (recall).
RAGProduction
Both matter. High precision + low recall = you missed answers. High recall + low precision = your LLM has to wade through noise. Track both per query.
→ RAG eval lesson
Reranker
A second-pass model that re-scores retrieved candidates by reading query + doc together.
RAG
Cross-encoders (the 'right tool' for reranking) score query/document pairs directly, modelling fine-grained interactions that bi-encoder vector similarity can't. Examples: Cohere Rerank, BGE Reranker. Two-stage retrieval (recall + rerank) is a major quality win.
RLHF
Reinforcement Learning from Human Feedback — the alignment technique behind ChatGPT.
Training
Train a reward model on human preference labels, then optimize the LLM (often via PPO) to score well. Newer methods (DPO, IPO) skip the explicit RL step. Critical for making raw next-token models actually useful.
Sampling
Choosing a token from the model's output probability distribution.
FoundationsPrompting
Greedy = always pick the most likely. Temperature reshapes the distribution (low → deterministic, high → varied). Top-p / top-k clip the tail. For most production work, start with temperature 0–0.3 and only raise it for creative tasks.
→ Sampling lesson
Speculative decoding
Use a small fast model to draft tokens, then verify with the big model in parallel.
Production
When the small model agrees with the big one (most of the time), you get the big model's quality at much higher speed. 2-4× throughput is typical.
Structured output
Forcing the model to output valid JSON (or another schema) — usually via API mode or strict prompting.
PromptingProduction
Native structured-output / tool-use APIs guarantee valid JSON against a schema. Prompt-only JSON works with strict instructions ('output starts with {', 'no markdown fences') but is more brittle.
→ Structured output lesson
System prompt
Durable instructions placed at the top of the conversation that set role, tone, and constraints.
Prompting
Treated by the model as higher-priority than user messages. Long, stable system prompts are the prime target for prompt caching.
Temperature
Sampling parameter that controls how random the model's output is.
Prompting
Lower (0.0–0.3) → deterministic, repeatable, good for code and extraction. Higher (0.7–1.0) → creative, varied, good for ideation. Above 1.0 generally hurts coherence.
Test-time compute
Letting the model think for longer at inference, instead of (only) scaling training.
Foundations
The 2024-2026 frontier. Reasoning models, ensembling, multi-sample voting, beam search, agent self-correction — all variants of "spend more compute per query for higher quality".
Token
A unit of text the model actually sees — usually a sub-word.
Foundations
Tokenizers (most use Byte-Pair Encoding) split input into integer IDs. A common rule of thumb is 1 token ≈ 0.75 English words ≈ 4 characters. Cost, context limits, and rate limits are all denominated in tokens — never in words.
→ Tokens lesson
Tool use
Letting an LLM call functions you define (search, DB query, send email, etc).
Production
Native APIs (Anthropic, OpenAI) let you declare tool schemas. The model returns a tool call; you execute it; you feed results back. Privilege containment is critical — gate dangerous tools behind separate auth.
Top-p (nucleus sampling)
Only sample from the smallest set of tokens whose cumulative probability is ≤ p.
Prompting
Useful safety net when temperature is non-zero — clips the long tail of unlikely tokens. Common default: 0.9–0.95.
Transformer
The neural-network architecture (introduced in 2017) that powers modern LLMs.
Foundations
A stack of identical blocks, each containing layer normalization, multi-head attention, and a feed-forward network — connected by residuals. Replaced RNN/LSTM as the dominant sequence model.
→ Transformer block lesson
Vector database
A specialized index for "find the k nearest vectors" at scale.
RAG
Examples: pgvector (Postgres), Pinecone, Weaviate, Qdrant, ChromaDB. Use approximate nearest neighbor (HNSW, IVF) for speed at the cost of 1-5% recall. You don't need one until you have ~100k+ chunks — a flat array is fine before that.
→ Vector DB lesson
vLLM
An inference engine that uses PagedAttention to serve LLMs efficiently at scale.
Production
Standard self-hosted serving stack alongside TGI (HuggingFace) and Triton (NVIDIA). Massive throughput improvements over naive Hugging Face Transformers serving.

The AI vocabulary, in plain English.

Agent

ANN (Approximate Nearest Neighbor)

Attention

Bi-encoder

BM25

Chain-of-thought (CoT)

Chunking

Context window

Cosine similarity

Cross-encoder

DPO (Direct Preference Optimization)

Drift

Embedding

Embedding model

Eval / Evaluation

Faithfulness

Few-shot prompting

Fine-tuning

Function calling / Tool calling

Guardrails

Hallucination

Hybrid search

Inference

KV cache

LLM

LLM-as-judge

LoRA

Lost in the middle

MLOps

MoE (Mixture of Experts)

Multimodal

Prompt caching

Prompt injection

Quantization

RAG

Reasoning model

Recall@k / Precision@k

Reranker

RLHF

Sampling

Speculative decoding

Structured output

System prompt

Temperature

Test-time compute

Token

Tool use

Top-p (nucleus sampling)

Transformer

Vector database

vLLM