Lesson 9 · 9 min
Hosting your fine-tune
Once you have an adapter, where does it run? Three honest options.
The three real options
1. Managed inference (Hugging Face Inference Endpoints, Together, Anyscale, Replicate)
- Pros: Upload adapter, get an HTTPS URL in minutes. No infra to run.
- Cons: Pay per token at a markup. Less control over latency / batching.
- Best for: Prototyping, low-volume production, teams without ops capacity.
2. Self-host with vLLM or TGI on cloud GPUs
- Pros: Full control. Better cost at scale (>100k requests/day breakeven). Modern serving (PagedAttention, continuous batching).
- Cons: GPU instances are expensive when idle. You own monitoring, autoscaling, deploys.
- Best for: Steady-state production at meaningful scale.
3. Merge adapter into base model and serve as a "new model"
- Pros: Slightly faster inference (no adapter overhead). Compatible with all serving stacks.
- Cons: Loses the swap-on-the-fly capability. Bigger artefact to ship.
- Best for: When you have one stable adapter and need maximum throughput.