Fine-tuning & AdaptationBack

35 XP

Lesson 9 · 9 min

Hosting your fine-tune

Once you have an adapter, where does it run? Three honest options.

The three real options

1. Managed inference (Hugging Face Inference Endpoints, Together, Anyscale, Replicate)

Pros: Upload adapter, get an HTTPS URL in minutes. No infra to run.
Cons: Pay per token at a markup. Less control over latency / batching.
Best for: Prototyping, low-volume production, teams without ops capacity.

2. Self-host with vLLM or TGI on cloud GPUs

Pros: Full control. Better cost at scale (>100k requests/day breakeven). Modern serving (PagedAttention, continuous batching).
Cons: GPU instances are expensive when idle. You own monitoring, autoscaling, deploys.
Best for: Steady-state production at meaningful scale.

3. Merge adapter into base model and serve as a "new model"

Pros: Slightly faster inference (no adapter overhead). Compatible with all serving stacks.
Cons: Loses the swap-on-the-fly capability. Bigger artefact to ship.
Best for: When you have one stable adapter and need maximum throughput.

1 / 3 · 0/1 checks