Skip to main content
NNextGen AI Learn

Glossary

The AI vocabulary, in plain English.

51 terms you'll meet in real AI work. Each one links back to the lesson that covers it in depth.

  • Agent

    An LLM in a loop with tools, a goal, and the ability to act.

    Production

    Patterns: ReAct (reason → act), planner+executor, multi-agent. The hard parts are tool definitions, error recovery, plan re-evaluation, and not getting stuck in loops. Most production 'agents' are 2-3 step pipelines, not autonomous workers.

  • ANN (Approximate Nearest Neighbor)

    Algorithms that find approximately-closest vectors orders of magnitude faster than exhaustive search.

    RAG

    Common variants: HNSW (graph-based, default in pgvector / Qdrant / FAISS), IVF (cluster-based), ScaNN (Google's hybrid). Trade ~1-5% recall for 10-1000x speedup.

  • Attention

    The mechanism that lets a transformer "look at" every previous token when computing each new token.

    Foundations

    For each token, the model computes Query/Key/Value projections and uses softmax-weighted sums of values to produce a context-aware representation. Self-attention is the architectural innovation behind transformers.

  • Bi-encoder

    A model that encodes query and document independently, then compares with a similarity metric.

    RAG

    Fast (you can pre-compute doc embeddings) but limited — it can't model interactions between query and doc. Used for first-stage retrieval. Compare with cross-encoder (reranker).

  • BM25

    A classic keyword-based scoring function used by Elasticsearch and most full-text search.

    RAG

    Strong baseline for retrieval. Particularly good at exact identifiers (error codes, model numbers) where embeddings struggle. Combined with vector search → hybrid search.

  • Chain-of-thought (CoT)

    Asking the model to reason out loud before answering.

    Prompting

    Sometimes helps a lot (multi-step logic, math). Sometimes does nothing (lookup, classification). On modern reasoning models with extended thinking, forcing visible CoT can hurt — let them think internally.

  • Chunking

    Splitting documents into retrieval-friendly fragments before embedding them.

    RAG

    Most important boring decision in RAG. Defaults: 400-800 tokens per chunk, 10-20% overlap, structure-aware (split on headers/sentences/function definitions, not raw character counts).

  • Context window

    The maximum number of tokens (input + output) a model can process in a single call.

    Foundations

    Modern models range from 4k to 1M+ tokens. Larger windows enable long documents and rich agent traces but cost grows roughly linearly in tokens, and quality often degrades in the middle of very long contexts.

  • Cosine similarity

    A metric for how similar two vectors are by angle (ignoring magnitude).

    RAG

    Defined as dot(a, b) / (|a| × |b|). Range is [-1, 1]. For typical sentence embeddings, paraphrases land around 0.85–0.97; unrelated content around 0.05–0.30.

  • Cross-encoder

    A model that scores query/document pairs together, much more accurately than bi-encoder.

    RAG

    Used as a reranker on top of bi-encoder retrieval. Too slow to run on every doc in your corpus, perfect for re-scoring the top 30-50 candidates.

  • DPO (Direct Preference Optimization)

    An alternative to RLHF that optimizes directly on preference pairs without an explicit reward model.

    Training

    Simpler and often more stable than PPO-RLHF. Increasingly the default for open-source alignment.

  • Drift

    When production data starts looking different from training data, degrading model quality.

    Production

    Two flavors: covariate drift (input distribution changes) and concept drift (the right answer for the same input changes). Detected via monitoring; remediated by re-training or alerting.

  • Embedding

    A vector representation of a token or text fragment in a learned semantic space.

    FoundationsRAG

    Trained such that semantically related items land nearby. Embedding similarity (usually cosine) is the basis of retrieval, clustering, and many lightweight classifiers.

  • Embedding model

    A specialized model whose output is a single vector representing input text.

    RAG

    Different from the LLM you generate with. Examples: text-embedding-3-small (OpenAI), voyage-3 (Voyage), BGE (open-source). For RAG, the embedding model choice often matters more than the generation model.

  • Eval / Evaluation

    Automatically measuring how well a prompt or model performs against a labelled test set.

    Production

    Skip eval and you're shipping vibes. Minimum: 20-50 inputs with expected outputs and a scoring function (exact match, regex, JSON-validity, LLM-as-judge). Track regressions per case, not just averages.

  • Faithfulness

    A RAG metric: does the answer use only what's in the retrieved context?

    RAGProduction

    An unfaithful RAG hallucinates politely from training data instead of grounding in retrieved chunks. Measured via LLM-as-judge by checking each claim against the context.

  • Few-shot prompting

    Including 3–5 input/output examples in the prompt to teach the model your format.

    Prompting

    Often outperforms paragraphs of instructions. Choose examples adversarially — include edge cases, not just easy ones. The model is a pattern-matcher; pattern-match it.

  • Fine-tuning

    Continuing a model's training on your own labelled data.

    Training

    Worth it when you need a capability or style the base model lacks (formatting, vocabulary, niche tone). Rarely worth it for adding knowledge — RAG is usually cheaper and stays fresh.

  • Function calling / Tool calling

    Native API mode that lets an LLM emit structured calls to functions you declare.

    Production

    You declare tools with JSON schemas; the model emits {name, args}; you run them; you feed results back. The basis of every modern agent. Anthropic, OpenAI, Google all support it.

  • Guardrails

    Extra checks around an LLM that filter unsafe inputs or outputs.

    ProductionSecurity

    Inputs: PII redaction, prompt-injection detection. Outputs: toxicity filter, schema validation, fact-check against retrieved context. Layered defense — no single guardrail is sufficient.

  • Hallucination

    When the model produces confident-sounding text that is factually wrong.

    ProductionRAG

    Caused by: distribution mismatch, missing context, ambiguous prompt, training-cutoff knowledge gaps. Mitigations: RAG with strict 'use only context' instructions, eval, citations, output guardrails.

  • Inference

    Running a trained model to produce outputs (vs *training*, which updates weights).

    Production

    Where most production cost lives. Optimization stack: serving framework (vLLM, TGI, Triton, SGLang), batching, KV-cache management, quantization, speculative decoding.

  • KV cache

    Stored Key/Value vectors from previous tokens, reused so the model doesn't recompute them.

    FoundationsProduction

    During inference, the cache grows linearly with context length — and so does memory and compute. Provider-side prompt caching lets multiple requests share the same KV state for stable prefixes.

  • LLM

    Large Language Model — a transformer trained on text to predict the next token.

    Foundations

    Generative model that maps a sequence of input tokens to a distribution over the next token. Modern LLMs (Claude, GPT, Gemini, Llama) range from a few billion to hundreds of billions of parameters and excel at language tasks because next-token prediction at scale turns out to require modelling syntax, semantics, and a surprising amount of world knowledge.

  • LLM-as-judge

    Using a (usually stronger) LLM to score the output of another LLM against a rubric.

    Production

    Practical for tasks where regex/exact-match doesn't work (summaries, rewrites, open-ended answers). Cache aggressively. Validate the judge against a small human-labelled subset to catch judge-specific biases.

  • LoRA

    Low-Rank Adaptation. Fine-tune efficiently by updating small "adapter" matrices instead of the full model.

    Training

    Trains in a fraction of the time and memory. QLoRA combines LoRA with 4-bit quantization, putting 7–13B fine-tunes within reach of consumer GPUs.

  • Lost in the middle

    Empirical finding that LLMs attend less to information in the middle of long contexts.

    ProductionRAG

    Accuracy U-shapes: high at the start and end, low in the middle. Mitigations: rerank before stuffing; deduplicate; place the best chunks at top and bottom of the context.

  • MLOps

    The discipline of running ML systems in production.

    Production

    Inference serving, monitoring (latency, drift, accuracy), CI/CD for models, cost tracking, A/B testing, rollback strategy. Closer to SRE than to data science.

  • MoE (Mixture of Experts)

    Architecture where only a subset of model parameters activates per token, via a learned router.

    Foundations

    Mixtral, DeepSeek, GPT-4 are believed to use MoE. Trades static parameter count for compute efficiency — the model can be huge but only a fraction runs per token.

  • Multimodal

    A model (or system) that processes and generates across more than one modality — text, image, audio, or video.

    Foundations

    Modern frontier models (GPT-4o, Gemini 2.0, Claude with vision) accept images interleaved with text inside the same context window. A modality encoder — a ViT for images, a Whisper-style encoder for audio — converts raw pixels or waveforms into tokens the LLM backbone already understands.

    The four most production-ready multimodal workloads in 2026: document understanding, chart/diagram Q&A, screenshot analysis, and video-frame captioning. For purely text tasks, use a text-only model — vision tokens cost 2–5× more per equivalent token at most providers.

    Example: sending a screenshot of an error traceback to a vision-capable model and asking it to explain the failure is faster than manually transcribing the text.

    Ready to validate these skills? The [Google Professional ML Engineer cert pack on CertQuests](https://certquests.com/google-professional-ml-engineer) covers multimodal architectures alongside core ML engineering topics.

  • Prompt caching

    Reusing a model's computed KV state across requests for the stable prefix of a prompt.

    Production

    Anthropic, OpenAI, and Google all support this. Putting the system prompt and few-shot examples first (variable user input last) makes the prefix stable — typical savings: 80–90% on the cached portion at a 5-minute TTL.

  • Prompt injection

    An attack where untrusted input causes the model to abandon its intended task.

    Security

    Direct: 'Ignore previous instructions...'. Indirect: hidden instructions in fetched content (web pages, emails). Mitigations: tagged delimiters, instruction-level guards, output filtering, privilege containment, red-teaming. There is no perfect defense yet — defense in depth.

  • Quantization

    Storing model weights at lower precision (8-bit, 4-bit) to shrink size and speed up inference.

    Production

    Modern quantization (GPTQ, AWQ, GGUF) can hit 4-bit with surprisingly little quality loss. Critical for self-hosting big models on consumer GPUs.

  • RAG

    Retrieval-Augmented Generation. Fetch relevant snippets, stuff into prompt, generate.

    RAG

    A pattern, not a product. Two halves: an offline ingest (chunk → embed → store) and an online query (embed → retrieve → stuff → generate). Most failures come from chunking and retrieval, not the LLM.

  • Reasoning model

    A model trained to produce internal reasoning before its final answer.

    Foundations

    Examples: Claude Opus extended thinking, OpenAI o-series. Burns tokens internally to reason, returns just the answer. Helpful for multi-step tasks; overkill for simple lookups.

  • Recall@k / Precision@k

    Retrieval metrics: of the top-k results, how many are relevant (precision); of all relevant items, how many appear in top-k (recall).

    RAGProduction

    Both matter. High precision + low recall = you missed answers. High recall + low precision = your LLM has to wade through noise. Track both per query.

  • Reranker

    A second-pass model that re-scores retrieved candidates by reading query + doc together.

    RAG

    Cross-encoders (the 'right tool' for reranking) score query/document pairs directly, modelling fine-grained interactions that bi-encoder vector similarity can't. Examples: Cohere Rerank, BGE Reranker. Two-stage retrieval (recall + rerank) is a major quality win.

  • RLHF

    Reinforcement Learning from Human Feedback — the alignment technique behind ChatGPT.

    Training

    Train a reward model on human preference labels, then optimize the LLM (often via PPO) to score well. Newer methods (DPO, IPO) skip the explicit RL step. Critical for making raw next-token models actually useful.

  • Sampling

    Choosing a token from the model's output probability distribution.

    FoundationsPrompting

    Greedy = always pick the most likely. Temperature reshapes the distribution (low → deterministic, high → varied). Top-p / top-k clip the tail. For most production work, start with temperature 0–0.3 and only raise it for creative tasks.

  • Speculative decoding

    Use a small fast model to draft tokens, then verify with the big model in parallel.

    Production

    When the small model agrees with the big one (most of the time), you get the big model's quality at much higher speed. 2-4× throughput is typical.

  • Structured output

    Forcing the model to output valid JSON (or another schema) — usually via API mode or strict prompting.

    PromptingProduction

    Native structured-output / tool-use APIs guarantee valid JSON against a schema. Prompt-only JSON works with strict instructions ('output starts with {', 'no markdown fences') but is more brittle.

  • System prompt

    Durable instructions placed at the top of the conversation that set role, tone, and constraints.

    Prompting

    Treated by the model as higher-priority than user messages. Long, stable system prompts are the prime target for prompt caching.

  • Temperature

    Sampling parameter that controls how random the model's output is.

    Prompting

    Lower (0.0–0.3) → deterministic, repeatable, good for code and extraction. Higher (0.7–1.0) → creative, varied, good for ideation. Above 1.0 generally hurts coherence.

  • Test-time compute

    Letting the model think for longer at inference, instead of (only) scaling training.

    Foundations

    The 2024-2026 frontier. Reasoning models, ensembling, multi-sample voting, beam search, agent self-correction — all variants of "spend more compute per query for higher quality".

  • Token

    A unit of text the model actually sees — usually a sub-word.

    Foundations

    Tokenizers (most use Byte-Pair Encoding) split input into integer IDs. A common rule of thumb is 1 token ≈ 0.75 English words ≈ 4 characters. Cost, context limits, and rate limits are all denominated in tokens — never in words.

  • Tool use

    Letting an LLM call functions you define (search, DB query, send email, etc).

    Production

    Native APIs (Anthropic, OpenAI) let you declare tool schemas. The model returns a tool call; you execute it; you feed results back. Privilege containment is critical — gate dangerous tools behind separate auth.

  • Top-p (nucleus sampling)

    Only sample from the smallest set of tokens whose cumulative probability is ≤ p.

    Prompting

    Useful safety net when temperature is non-zero — clips the long tail of unlikely tokens. Common default: 0.9–0.95.

  • Transformer

    The neural-network architecture (introduced in 2017) that powers modern LLMs.

    Foundations

    A stack of identical blocks, each containing layer normalization, multi-head attention, and a feed-forward network — connected by residuals. Replaced RNN/LSTM as the dominant sequence model.

  • Vector database

    A specialized index for "find the k nearest vectors" at scale.

    RAG

    Examples: pgvector (Postgres), Pinecone, Weaviate, Qdrant, ChromaDB. Use approximate nearest neighbor (HNSW, IVF) for speed at the cost of 1-5% recall. You don't need one until you have ~100k+ chunks — a flat array is fine before that.

  • vLLM

    An inference engine that uses PagedAttention to serve LLMs efficiently at scale.

    Production

    Standard self-hosted serving stack alongside TGI (HuggingFace) and Triton (NVIDIA). Massive throughput improvements over naive Hugging Face Transformers serving.