Skip to main content

Lesson 9 · 9 min

Hosting your fine-tune

Once you have an adapter, where does it run? Three honest options.

The three real options

1. Managed inference (Hugging Face Inference Endpoints, Together, Anyscale, Replicate)

  • Pros: Upload adapter, get an HTTPS URL in minutes. No infra to run.
  • Cons: Pay per token at a markup. Less control over latency / batching.
  • Best for: Prototyping, low-volume production, teams without ops capacity.

2. Self-host with vLLM or TGI on cloud GPUs

  • Pros: Full control. Better cost at scale (>100k requests/day breakeven). Modern serving (PagedAttention, continuous batching).
  • Cons: GPU instances are expensive when idle. You own monitoring, autoscaling, deploys.
  • Best for: Steady-state production at meaningful scale.

3. Merge adapter into base model and serve as a "new model"

  • Pros: Slightly faster inference (no adapter overhead). Compatible with all serving stacks.
  • Cons: Loses the swap-on-the-fly capability. Bigger artefact to ship.
  • Best for: When you have one stable adapter and need maximum throughput.