Skip to content
Live pricing · updated daily

The definitive ranking of 319 AI models — by quality, cost, and value.

Independent. Open-source. Every number traced to its source. Compare 319 models across 19 providers on 60 GPUs — the data desk for everyone deciding which model to ship.

319Models
60GPUs
19Providers
12 moPrice history
FreePublic API
Open methodology
MIT-licensed
No signup
Refreshed daily
Pricing · last 12 months

Output prices fell 41% YoY.

Median $/1M output across the top 20 models, weighted by request volume. Milestone markers mark the major model releases.

GPT-4o-miniLlama 3.1 405Bo1-previewDeepSeek R1Llama 3.3Today$2$3$4
Today
$2.14
12 mo ago
$3.62
Analysis
The Leaderboard
Open full view
🥇#1
Most Popular
Alibaba

Qwen 2.5 7B

Qwen 2.5 · 7.6B · 128K ctx

Quality
70
Cost $/M
$0.200
Value
350.0
Calculate ROI
🥈#2
Alibaba

Qwen 3 8B

Qwen 3 · 8.2B · 128K ctx

Quality
70
Cost $/M
$0.200
Value
350.0
Calculate ROI
🥉#3
Alibaba

Qwen 2.5 1.5B

Qwen 2.5 · 1.5B · 32K ctx

Quality
Cost $/M~
$0.027
Value
1862.0
Calculate ROI
#4
Alibaba

Qwen 2.5 3B

Qwen 2.5 · 3.1B · 32K ctx

Quality
58
Cost $/M
$0.100
Value
580.0
Calculate ROI
#5
Meta

Llama 3.1 8B

Llama 3.1 · 8B · 128K ctx

Quality
Cost $/M
$0.180
Value
322.2
Calculate ROI
#6
Alibaba

Qwen 3 4B

Qwen 3 · 4B · 128K ctx

Quality
57
Cost $/M
$0.100
Value
570.0
Calculate ROI
#7
Pareto Q×C×S
Meta

Llama 3.2 3B

Llama 3.2 · 3.2B · 128K ctx

Quality
55
Cost $/M
$0.060
Value
916.7
Calculate ROI
#8
Alibaba

Qwen 3 32B

Qwen 3 · 32.8B · 128K ctx

Quality
74
Cost $/M
$0.800
Value
92.5
Calculate ROI
#9
Pareto Q×C×S
Meta

Llama 3.2 1B

Llama 3.2 · 1.2B · 128K ctx

Quality
38
Cost $/M
$0.030
Value
1266.7
Calculate ROI
#10
Meta

Llama 3 8B

Llama 3 · 8B · 8K ctx

Quality
63
Cost $/M
$0.200
Value
315.0
Calculate ROI
#11
Meta

HelpSteer2 Llama 3.1 70B

Llama 3.1 · 70.6B · 128K ctx

Quality
82
Cost $/M
$0.500
Value
164.0
Calculate ROI
#12
Meta

Llama 3.1 70B

Llama 3.1 · 70.6B · 128K ctx

Quality
75
Cost $/M
$0.880
Value
85.2
Calculate ROI
#13
Meta

Llama 3.1 70B Turbo

Llama 3.1 · 70.6B · 128K ctx

Quality
Cost $/M
$0.880
Value
56.8
Calculate ROI
#14
Mistral

NV EmbedQA Mistral 7B

NV EmbedQA · 7.2B · 32K ctx

Quality
Cost $/M
$0.012
Value
4166.7
Calculate ROI
#15
Mistral

E5 Mistral 7B

E5 · 7.1B · 32K ctx

Quality
Cost $/M
$0.016
Value
3125.0
Calculate ROI
#16
Google

Gemma 3 1B

Gemma 3 · 1B · 32K ctx

Quality
35
Cost $/M~
$0.018
Value
1955.1
Calculate ROI
#17
Mistral

BioMistral 7B

BioMistral · 7.2B · 32K ctx

Quality
Cost $/M~
$0.129
Value
387.9
Calculate ROI
#18
Mistral

Mistral 7B

Mistral · 7.3B · 32K ctx

Quality
56
Cost $/M
$0.200
Value
280.0
Calculate ROI
#19
Meta

TinyLlama 1.1B Chat

TinyLlama · 1.1B · 2K ctx

Quality
Cost $/M~
$0.021
Value
2412.1
Calculate ROI
#20
Meta

TinyLlama 1.1B

TinyLlama · 1.1B · 2K ctx

Quality
Cost $/M~
$0.021
Value
2412.1
Calculate ROI
#21
Pareto Q×C×S
Alibaba

Qwen 2.5 14B

Qwen 2.5 · 14.8B · 128K ctx

Quality
76
Cost $/M
$0.400
Value
190.0
Calculate ROI
#22
Alibaba

Qwen 2.5 72B

Qwen 2.5 · 72.7B · 128K ctx

Quality
77
Cost $/M
$1.20
Value
64.2
Calculate ROI
#23
Microsoft

Phi 2

Phi · 2.7B · 2K ctx

Quality
Cost $/M~
$0.054
Value
931.0
Calculate ROI
#24
Alibaba

Qwen 2.5 32B

Qwen 2.5 · 32.5B · 128K ctx

Quality
73
Cost $/M
$0.800
Value
91.3
Calculate ROI
#25
DeepSeek

DeepSeek R1 Distill 1.5B

DeepSeek R1 · 1.5B · 128K ctx

Quality
42
Cost $/M~
$0.027
Value
1564.1
Calculate ROI
#26
DeepSeek

DeepSeek R1 Distill 8B

DeepSeek R1 · 8B · 128K ctx

Quality
Cost $/M
$0.200
Value
440.0
Calculate ROI
#27
DeepSeek

DeepSeek R1 Distill 14B

DeepSeek R1 · 14.8B · 128K ctx

Quality
Cost $/M
$0.300
Value
293.3
Calculate ROI
#28
DeepSeek

DeepSeek R1 Distill 32B

DeepSeek R1 · 32.8B · 128K ctx

Quality
Cost $/M
$0.600
Value
146.7
Calculate ROI
#29
DeepSeek

DeepSeek R1 Distill 70B

DeepSeek R1 · 70.6B · 128K ctx

Quality
Cost $/M
$0.880
Value
100.0
Calculate ROI
#30
Pareto Q×C×S
DeepSeek

DeepSeek R1

DeepSeek R1 · 671B MoE (37B active) · 128K ctx

Quality
88
Cost $/M
$2.19
Value
40.2
Calculate ROI

Showing 319 of 319 models

Tracking 319 AI models across 60 GPUs and 19 providers, updated daily. The top-ranked model for overall quality is BGE Small EN v1.5 with a quality score of , available from $0.00/million output tokens. Rankings use InferenceBench's composite scoring combining benchmark results (MMLU, HumanEval, GSM8K), inference cost, and throughput efficiency.

Top picks

One winner per axis — best value, best quality, cheapest, fastest

Market Map

Where every model sits on the price-vs-quality curve, and which few set the floor for everyone else.

Open comparator
As of 20 JUN 2026
$2.19/Mbuys quality 88
vs
$150/Mfor the same tier
68× spread
708090100$0.03$0.30$3$30o1 · $60/M · Q 93GPT-4.5 Preview · $150/M · Q 93Grok 3 · $15/M · Q 90Claude Opus 4 · $75/M · Q 90Gemini 2.0 Pro · $4.00/M · Q 88o3-mini · $4.40/M · Q 86Nemotron Ultra 253B · $6.00/M · Q 86Claude Sonnet 4 · $15/M · Q 86Nemotron 340B · $4.20/M · Q 85GPT-4o · $10/M · Q 85Llama 4 Maverick · $1.80/M · Q 84Nemotron-3 Super 120B · $2.40/M · Q 84Llama 3.1 Nemotron 70B Instruct · $1.00/M · Q 83Qwen 3 235B · $3.00/M · Q 83o1-mini · $12/M · Q 83MiniMax M2.7 · $2.80/M · Q 82Llama 3.1 405B · $3.50/M · Q 81Command A · $10/M · Q 81Llama 3.1 Nemotron 70B Reward · $0.500/M · Q 80Qwen 2.5 Coder 32B · $0.800/M · Q 80Llama 3 70B · $0.880/M · Q 80Gemini 1.5 Pro · $5.00/M · Q 801. Gemini 2.0 Flash · $0.400/M · Q 8012. DeepSeek V3 · $0.420/M · Q 8123. HelpSteer2 Llama 3.1 70B · $0.500/M · Q 8234. Nemotron 70B · $0.880/M · Q 8345. Llama 3.2 90B Vision Instruct · $1.20/M · Q 8456. DeepSeek R1 · $2.19/M · Q 8867. Grok-3 · $15/M · Q 9178. Llama 4 Behemoth · $16/M · Q 938
30 plotted8 on frontierPareto$/1M out · log
On the Frontier
Model$/MQ
1Gemini 2.0 FlashPick$0.40080
2DeepSeek V3$0.42081
3HelpSteer2 Llama 3.1 70B$0.50082
4Nemotron 70B$0.88083
5Llama 3.2 90B Vision Instruct$1.2084
6DeepSeek R1$2.1988
7Grok-3$1591
8Llama 4 Behemoth$1693
Cheapest 85+ quality$2.19/M
Median price plotted$4.00/M
Frontier vs median10.0× cheaper

Market Moves

Biggest price drops and the freshest releases — past 30 days

Class Leaders

Best for Code · Math · Reasoning · Vision · Small (<15B) · Large (70B+)

All categories

Workloads

Curated model + GPU shortlist for the workload you're building

Provider spotlight

The top inference providers ranked by reputation

Sandbox

Chat with any model · compare two at once · vote in the arena

Open arena

News & research

Latest deep-dives from the benchmark team

All posts

Worked examples

Pre-computed scenarios from the engine — Build vs Buy · Bottleneck · Forecast

See all analyses
Build vs Buy

Llama 3 70B 1M Context chatbot, 1B tokens / month

Two cost paths for the same workload: rent the API per-token from the cheapest provider, or rent 2× H100 PCIe 24×7 and serve it yourself.

API (cheapest provider)$740/mo
Self-host (2× H100 PCIe)$3.3k/mo
$2.6kspent extra/mo · -352% pricier to self-host
Bottleneck X-Ray

Llama 3 70B 1M Context FP8 on H100 PCIe, batch 1

Decode is memory-bandwidth dominated at batch 1: every token reloads the full weight matrix. Compute sits idle. Splitting across 2 GPUs (TP=2) doubles the BW ceiling.

99%BW-bound
BW ceiling28 tok/s
Compute ceiling11k tok/s
+28 tok/sat TP=2 (2× BW)
Where prices live

Output-token prices · top 20 by quality

The spread between budget and premium $/M, today. Sparkline shows the sorted distribution on log scale.

63×spread · p10 → p90
  • Budget · p10Llama 3.2 90B Vision Instruct$1.20
  • MedianNemotron Ultra 253B$6.00
  • Premium · p90Claude Opus 4$75.00
20 models with verified pricing & quality

Build on it

The data that powers the leaderboard, exposed as a free public API

pricing.ts
curl-friendly · zero auth
// Get the cheapest provider for Llama 3.1 70B
const res = await fetch(
  'https://inferencebench.io/api/v1/models/meta-llama/llama-3.1-70b/pricing'
);
const { providers } = await res.json();

providers
  .sort((a, b) => a.output_per_m - b.output_per_m)
  .slice(0, 3)
  .forEach((p) =>
    console.log(`${p.provider} · $${p.output_per_m}/M`)
  );

// DeepInfra   · $0.40/M
// Groq        · $0.79/M
// Openrouter  · $0.79/M
Public REST API

Build whatever you want on the data

Same data the leaderboard renders from — exposed as plain JSON. No API key. Rate-limit generous.

  • GET/api/v1/models
    297
  • GET/api/v1/gpus
    60
  • GET/api/v1/providers
    19
  • GET/api/v1/pricing
    live snapshots
  • GET/api/v1/leaderboard
    9 categories
Read the API docs

Tools & calculators

Go deeper into the parts you care about — every page is free, no signup

Methodology & trust

How we measure, where the data comes from, how to build on it

Frequently asked questions

Common questions about the benchmark, methodology, and how the data is sourced

What is an AI inference benchmark?

An AI inference benchmark measures how fast a GPU or cloud provider can generate tokens from a large language model (LLM). Key metrics include tokens per second (throughput), time to first token (TTFT), inter-token latency (ITL), and cost per million tokens.

How does InferenceBench measure GPU performance?

InferenceBench uses a roofline performance model combined with CUDA kernel-level modeling (FlashAttention, PagedAttention, fused kernels) to predict real-world inference throughput. Results are validated against actual benchmarks from the HuggingFace LLM Perf Leaderboard and provider-reported data.

Which GPU is fastest for LLM inference?

Performance depends on model size. For large models (70B+), the NVIDIA B200 and H200 lead in throughput. For mid-size models (7B–30B), the H100 SXM offers the best price-performance. For budget deployments, the RTX 4090 and L40S are strong contenders.

How often is benchmark data updated?

Pricing data is refreshed every 6 hours via automated API calls to providers. Benchmark results are updated when new GPU hardware or model architectures are released. Community-submitted data is verified before inclusion.

Ready to pick a model?

The full live ranking is one click away — sort, filter, and compare every model by quality, cost, and value.

Built with care · Open source · MIT-licensed data · No signup