Lesson 2 · 12 min
Inference servers — vLLM, TGI, Triton, SGLang
Don't serve LLMs from raw Hugging Face Transformers. The good engines exist for a reason.
Why a dedicated engine
Naive serving (a transformers.generate() per request) wastes 80%+ of your GPU. Modern engines do three things you can't get for free:
- PagedAttention / KV-cache management — like virtual memory for the KV cache. Lets concurrent requests share GPU memory efficiently.
- Continuous batching — instead of waiting for a fixed batch, the engine swaps requests in/out as they finish, keeping the GPU always full.
- Optimized kernels — fused attention (FlashAttention), CUDA graphs, kernel autotuning.