Skip to main content

Lesson 2 · 12 min

Inference servers — vLLM, TGI, Triton, SGLang

Don't serve LLMs from raw Hugging Face Transformers. The good engines exist for a reason.

Why a dedicated engine

Naive serving (a transformers.generate() per request) wastes 80%+ of your GPU. Modern engines do three things you can't get for free:

  1. PagedAttention / KV-cache management — like virtual memory for the KV cache. Lets concurrent requests share GPU memory efficiently.
  2. Continuous batching — instead of waiting for a fixed batch, the engine swaps requests in/out as they finish, keeping the GPU always full.
  3. Optimized kernels — fused attention (FlashAttention), CUDA graphs, kernel autotuning.