Compare · Models
Frontier AI models, side by side.
12 models across closed and open-source labs. Pricing per 1M tokens (input / output), context window, native reasoning, tool use, multimodal capability, license. Plus a one-line "best for" and one-line "watch out".
Gemini 2.5 Flash
1M ctx
Input / 1M
$0.10
Output / 1M
$0.30
Cheapest frontier-adjacent option. Production cheap-tier slot for many workloads.
Quality varies more than tier-down siblings at other labs. Run an eval before swapping in.
DeepSeek V4 (open)
DeepSeek
128k ctx
Input / 1M
$0.27
Output / 1M
$1.10
Cheap reasoning. Open weights with frontier-tier reasoning quality at small sizes.
Tool-use ergonomics are improving but still trail Anthropic. Verify on your own eval.
Qwen 3 (open)
Alibaba
128k ctx
Input / 1M
$0.40
Output / 1M
$1.20
Strong multilingual coverage (esp. CJK), good math, permissive license — easy to self-host.
English-language quality slightly below Llama 4 on most tasks.
Llama 4 70B (open)
Meta
256k ctx
Input / 1M
$0.50
Output / 1M
$1.50
Self-hosting at scale. Quality competitive with mid-tier API models, much cheaper at >30k req/day.
License restricts use above 700M MAU. Ops cost is real if you self-host.
Claude Haiku 4.5
Anthropic
200k ctx
Input / 1M
$0.80
Output / 1M
$4.00
High-volume cheap-tier work. Classification, simple extraction, fast autocomplete.
Quality drop on multi-step reasoning. Tier up to Sonnet when the task is non-trivial.
Gemini 2.5 Pro
2M ctx
Input / 1M
$0.85
Output / 1M
$3.50
Massive context (2M tokens), native video and audio. Good cost/quality.
Tool use ergonomics still trail Anthropic / OpenAI in some patterns.
Mistral Large 3
Mistral AI
128k ctx
Input / 1M
$2.00
Output / 1M
$6.00
European data residency, strong multilingual, esp. EU languages.
Smaller context window. No native multimodal. Quality lags Anthropic / OpenAI on hardest tasks.
Llama 4 405B (open)
Meta
1M ctx
Input / 1M
$2.50
Output / 1M
$5.00
Frontier-tier open model. Self-host if you have the GPUs; otherwise via Together / Anyscale.
Heavy. Single H100 won't fit at FP16. Multi-GPU or aggressive quantization required.
Claude Sonnet 4.6
Anthropic
200k ctx
Input / 1M
$3.00
Output / 1M
$15.00
Production-grade default. Best price/quality on most workloads. Native tool use is excellent.
No video / audio in. Reach for Gemini for those.
GPT GPT-5
OpenAI
400k ctx
Input / 1M
$5.00
Output / 1M
$15.00
Strong overall. Best ecosystem (Assistants API, native voice, code interpreter, web search).
Pricing tier complexity. GPT-5-mini and 5-nano variants offer real cost wins on simpler work.
Claude Opus 4.7
Anthropic
1M ctx
Input / 1M
$15.00
Output / 1M
$75.00
Hardest reasoning + agent tasks, code generation at scale, long-context analysis.
Premium pricing — tier down to Sonnet for the easy 80% of traffic.
GPT o-series (reasoning)
OpenAI
400k ctx
Input / 1M
$15.00
Output / 1M
$60.00
Math, hard reasoning, multi-step planning. Burns tokens on internal reasoning before answering.
Latency 10x non-reasoning. Wrong tool for chat / autocomplete.
Gemini 2.5 Flash
Cheapest frontier-adjacent option. Production cheap-tier slot for many workloads.
Quality varies more than tier-down siblings at other labs. Run an eval before swapping in.
DeepSeek V4 (open)
DeepSeek
Cheap reasoning. Open weights with frontier-tier reasoning quality at small sizes.
Tool-use ergonomics are improving but still trail Anthropic. Verify on your own eval.
Qwen 3 (open)
Alibaba
Strong multilingual coverage (esp. CJK), good math, permissive license — easy to self-host.
English-language quality slightly below Llama 4 on most tasks.
Llama 4 70B (open)
Meta
Self-hosting at scale. Quality competitive with mid-tier API models, much cheaper at >30k req/day.
License restricts use above 700M MAU. Ops cost is real if you self-host.
Claude Haiku 4.5
Anthropic
High-volume cheap-tier work. Classification, simple extraction, fast autocomplete.
Quality drop on multi-step reasoning. Tier up to Sonnet when the task is non-trivial.
Gemini 2.5 Pro
Massive context (2M tokens), native video and audio. Good cost/quality.
Tool use ergonomics still trail Anthropic / OpenAI in some patterns.
Mistral Large 3
Mistral AI
European data residency, strong multilingual, esp. EU languages.
Smaller context window. No native multimodal. Quality lags Anthropic / OpenAI on hardest tasks.
Llama 4 405B (open)
Meta
Frontier-tier open model. Self-host if you have the GPUs; otherwise via Together / Anyscale.
Heavy. Single H100 won't fit at FP16. Multi-GPU or aggressive quantization required.
Claude Sonnet 4.6
Anthropic
Production-grade default. Best price/quality on most workloads. Native tool use is excellent.
No video / audio in. Reach for Gemini for those.
GPT GPT-5
OpenAI
Strong overall. Best ecosystem (Assistants API, native voice, code interpreter, web search).
Pricing tier complexity. GPT-5-mini and 5-nano variants offer real cost wins on simpler work.
Claude Opus 4.7
Anthropic
Hardest reasoning + agent tasks, code generation at scale, long-context analysis.
Premium pricing — tier down to Sonnet for the easy 80% of traffic.
GPT o-series (reasoning)
OpenAI
Math, hard reasoning, multi-step planning. Burns tokens on internal reasoning before answering.
Latency 10x non-reasoning. Wrong tool for chat / autocomplete.
Numbers move quickly. Treat this as a snapshot. Verify against the vendor docs before committing to a tier in your stack.