Live pricing · updated daily

The definitive ranking of 319 AI models — by quality, cost, and value.

Independent. Open-source. Every number traced to its source. Compare 319 models across 19 providers on 60 GPUs — the data desk for everyone deciding which model to ship.

Open the leaderboard Cost calculator Use the API →

319Models

60GPUs

19Providers

12 moPrice history

FreePublic API

Open methodology

MIT-licensed

No signup

Refreshed daily

THE WIREEDITORS

Updated 2w0 / 24h

PRICEFireworks cuts GPT-4o-mini−18%2w

LAUNCHQwen3 Coder · 94 HumanEval · MITOPEN2w

BENCHDeepSeek R1 retakes Math · 96.4 GSM8K#12w

INFRATogether adds 12 new H100 SXM5+122w

FCASTMedian $/1M output tokens dropped MoM−8%2w

Pricing · last 12 months

Output prices fell −41% YoY.

Median $/1M output across the top 20 models, weighted by request volume. Milestone markers mark the major model releases.

Today

$2.14

12 mo ago

$3.62

Analysis

The Leaderboard

Open full view→

Just announcedNano Banana 2 (Gemini 3.1 Flash Image)·Jun. 2026 Nano Banana Pro (Gemini 3 Pro Image)·Jun. 2026 Claude Fable 5·Jun. 2026 Qwen3.7 Plus·Jun. 2026 Claude Opus 4.8·May 2026 Qwen3.7 Max·May 2026 Gemini 3.5 Flash·May 2026 Gemini 3.1 Flash Lite·May 202622 entries with verified release dates but no full architecture/pricing yet

#▲	Model	Params	Quality	Input $/M	Output $/M	Speedⓘ	Tokens/$ⓘ	Context	Providersⓘ	Reasoning×ⓘ	Valueⓘ	Badge	Actions	Released
🥇	Qwen 2.5 7BQwen 2.5	7.6B	70	$0.200	$0.200	27 tok/s	135.50 M	128K	4	—	350.0	Most Popular	ROI →	Sep. 2024
🥈	Qwen 3 8BQwen 3	8.2B	70	$0.200	$0.200	49 tok/s	243.90 M	128K	4	12.7×	350.0		ROI →	Apr. 2025
🥉	Qwen 2.5 1.5BQwen 2.5	1.5B	—	$0.027~	$0.027~	—	—	32K	—	—	1862.0		ROI →	Sep. 2024
4	Qwen 2.5 3BQwen 2.5	3.1B	58	$0.100	$0.100	49 tok/s*	490.00 M	32K	1	—	580.0		ROI →	Sep. 2024
5	Llama 3.1 8BLlama 3.1	8B	—	$0.180	$0.180	35 tok/s	193.24 M	128K	10	—	322.2		ROI →	Jul. 2024
6	Qwen 3 4BQwen 3	4B	57	$0.100	$0.100	12 tok/s	117.30 M	128K	1	3.7×	570.0		ROI →	Apr. 2025
7	Llama 3.2 3BLlama 3.2	3.2B	55	$0.060	$0.060	154 tok/s	2562.50 M	128K	3	—	916.7	Pareto Q×C×S	ROI →	Sep. 2024
8	Qwen 3 32BQwen 3	32.8B	74	$0.800	$0.800	76 tok/s	94.70 M	128K	7	11.5×	92.5		ROI →	Apr. 2025
9	Llama 3.2 1BLlama 3.2	1.2B	38	$0.030	$0.030	33 tok/s	1112.66 M	128K	5	—	1266.7	Pareto Q×C×S	ROI →	Sep. 2024
10	Llama 3 8BLlama 3	8B	63	$0.200	$0.200	—	—	8K	2	—	315.0		ROI →	Apr. 2024
11	HelpSteer2 Llama 3.1 70BLlama 3.1	70.6B	82	$0.500	$0.500	—	—	128K	5	—	164.0		ROI →	Aug. 2024
12	Llama 3.1 70BLlama 3.1	70.6B	75	$0.880	$0.880	37 tok/s	42.57 M	128K	8	—	85.2		ROI →	Jul. 2024
13	Llama 3.1 70B TurboLlama 3.1	70.6B	—	$0.880	$0.880	—	—	128K	2	—	56.8		ROI →	Jul. 2024
14	NV EmbedQA Mistral 7BNV EmbedQA	7.2B	—	$0.012	$0.012	157 tok/s*	13083.33 M	32K	1	—	4166.7		ROI →	Jun. 2024
15	E5 Mistral 7BE5	7.1B	—	$0.016	$0.016	160 tok/s*	10000.00 M	32K	1	—	3125.0		ROI →	Dec. 2023
16	Gemma 3 1BGemma 3	1B	35	$0.018~	$0.018~	—	—	32K	—	—	1955.1		ROI →	Mar. 2025
17	BioMistral 7BBioMistral	7.2B	—	$0.129~	$0.129~	—	—	32K	—	—	387.9		ROI →	Feb. 2024
18	Mistral 7BMistral	7.3B	56	$0.200	$0.200	—	—	32K	3	—	280.0		ROI →	Sep. 2023
19	TinyLlama 1.1B ChatTinyLlama	1.1B	—	$0.021~	$0.021~	—	—	2K	—	—	2412.1		ROI →	Jan. 2024
20	TinyLlama 1.1BTinyLlama	1.1B	—	$0.021~	$0.021~	—	—	2K	—	—	2412.1		ROI →	Jan. 2024
21	Qwen 2.5 14BQwen 2.5	14.8B	76	$0.400	$0.400	49 tok/s*	122.50 M	128K	2	—	190.0	Pareto Q×C×S	ROI →	Sep. 2024
22	Qwen 2.5 72BQwen 2.5	72.7B	77	$1.20	$1.20	21 tok/s	17.58 M	128K	6	—	64.2		ROI →	Sep. 2024
23	Phi 2Phi	2.7B	—	$0.054~	$0.054~	—	—	2K	—	—	931.0		ROI →	Dec. 2023
24	Qwen 2.5 32BQwen 2.5	32.5B	73	$0.800	$0.800	23 tok/s*	28.75 M	128K	2	—	91.3		ROI →	Sep. 2024
25	DeepSeek R1 Distill 1.5BDeepSeek R1	1.5B	42	$0.027~	$0.027~	—	—	128K	—	9.0×	1564.1		ROI →	Jan. 2025
26	DeepSeek R1 Distill 8BDeepSeek R1	8B	—	$0.200	$0.200	41 tok/s*	203.58 M	128K	1	—	440.0		ROI →	Jan. 2025
27	DeepSeek R1 Distill 14BDeepSeek R1	14.8B	—	$0.300	$0.300	22 tok/s	73.98 M	128K	1	7.8×	293.3		ROI →	Jan. 2025
28	DeepSeek R1 Distill 32BDeepSeek R1	32.8B	—	$0.600	$0.600	—	—	128K	3	8.3×	146.7		ROI →	Jan. 2025
29	DeepSeek R1 Distill 70BDeepSeek R1	70.6B	—	$0.880	$0.880	31 tok/s	34.90 M	128K	6	2.4×	100.0		ROI →	Jan. 2025
30	DeepSeek R1DeepSeek R1	671B	88	$0.550	$2.19	37 tok/s	16.91 M	128K	5	19.0×	40.2	Pareto Q×C×S	ROI →	Jan. 2025

Showing 1–30 of 319 models

…

Data freshness:pricing: 1 mo agolatency: 1 mo agoquality benchmarks: 1 yr ago

Tracking 319 AI models across 60 GPUs and 19 providers, updated daily. The top-ranked model for overall quality is BGE Small EN v1.5 with a quality score of —, available from $0.00/million output tokens. Rankings use InferenceBench's composite scoring combining benchmark results (MMLU, HumanEval, GSM8K), inference cost, and throughput efficiency.

Top picks

One winner per axis — best value, best quality, cheapest, fastest

Best value

Market Map

Where every model sits on the price-vs-quality curve, and which few set the floor for everyone else.

Open comparator→

As of 20 JUN 2026

$2.19/Mbuys quality 88

$150/Mfor the same tier

68× spread

30 plotted8 on frontierPareto$/1M out · log

On the Frontier

	Model	$/M	Q
1	Gemini 2.0 FlashPick	$0.400	80
2	DeepSeek V3	$0.420	81
3	HelpSteer2 Llama 3.1 70B	$0.500	82
4	Nemotron 70B	$0.880	83
5	Llama 3.2 90B Vision Instruct	$1.20	84
6	DeepSeek R1	$2.19	88
7	Grok-3	$15	91
8	Llama 4 Behemoth	$16	93

Cheapest 85+ quality$2.19/M

Median price plotted$4.00/M

Frontier vs median10.0× cheaper

Market Moves

Biggest price drops and the freshest releases — past 30 days

Biggest provider-price spreads

Switch & save

Qwen 3 Coder 8BCheapest on Alibaba
−92%$1.80 → $0.15
DeepSeek V3Cheapest on DeepSeek
−85%$2.80 → $0.42
Llama 3.2 1BCheapest on Together
−85%$0.20 → $0.03
Llama Guard 3 8BCheapest on Openrouter
−85%$0.20 → $0.03
Code Llama 7BCheapest on Together
−83%$1.20 → $0.20

Just announced

By release date

Workloads

Curated model + GPU shortlist for the workload you're building

Chatbot

Realtime UX · low p50

HumanEval-tuned picks

Long context · RAG-ready

Start with

Llama 3.2 1B$0.030/M

132 models matched

Real-time translation

Low-latency multilingual

Provider spotlight

The top inference providers ranked by reputation

Reputation = pricing competitiveness · uptime · model coverage · feature parity

See all 10 providers →

Sandbox

Chat with any model · compare two at once · vote in the arena

Open arena→

Draft a release note for our new pricing API.

Llama 3.1 70B· via Together

We just shipped a public pricing API — live $/M tokens across 19 providers in one call. No key required. Full schema in the docs…

Streamingswap providers · no login

Playground

Chat live with any model

Streamed responses, side-by-side providers, real measured throughput. No login.

Open

When should we use RAG vs fine-tuning?

GPT-5

RAG is the right move when answers must cite source documents that change daily — search beats fine-tuning here.

Claude Sonnet 4

Use RAG for fresh, citable knowledge; fine-tune when you need a fixed persona or domain-specific style.

both streaminglatency · $/M side-by-side

Head-to-head

Compare two models on the same prompt

Watch both stream at once. See latency, $/M tokens and answer quality side-by-side.

Open

Write a SQL query: top 10 customers by revenue last quarter.

Model A

SELECT c.id, SUM(o.amount) AS rev FROM customers c JOIN orders o ON o.customer_id = c.id WHERE…

Model B

WITH last_q AS (SELECT * FROM orders WHERE quarter(date) = …) SELECT customer_id, SUM(amount)…

Blind · identities revealed after voteElo updates live

Arena

Blind-vote the head-to-heads

Two anonymous models answer. You pick the winner. Elo ranking updates live.

Open

News & research

Latest deep-dives from the benchmark team

All posts→

NVIDIAApr 14, 202622 min read

Nemotron Super 120B vs Ultra 253B: NVIDIA's Best Open-Weight Models Benchmarked

Nemotron Ultra FP8 scores 9.47 MT-Bench, beating its own BF16 at 9.2. Super hits 6,567 tok/s. Both fail tool use and vision at 0%. Full SWOT analysis.

Read the full post

Latest research

All posts

Worked examples

Pre-computed scenarios from the engine — Build vs Buy · Bottleneck · Forecast

See all analyses→

Build vs Buy

Llama 3 70B 1M Context chatbot, 1B tokens / month

Two cost paths for the same workload: rent the API per-token from the cheapest provider, or rent 2× H100 PCIe 24×7 and serve it yourself.

API (cheapest provider)$740/mo

Self-host (2× H100 PCIe)$3.3k/mo

$2.6kspent extra/mo · -352% pricier to self-host

Open the calculator

Bottleneck X-Ray

Llama 3 70B 1M Context FP8 on H100 PCIe, batch 1

Decode is memory-bandwidth dominated at batch 1: every token reloads the full weight matrix. Compute sits idle. Splitting across 2 GPUs (TP=2) doubles the BW ceiling.

99%BW-bound

BW ceiling28 tok/s

Compute ceiling11k tok/s

+28 tok/sat TP=2 (2× BW)

See the X-Ray method

Where prices live

Output-token prices · top 20 by quality

The spread between budget and premium $/M, today. Sparkline shows the sorted distribution on log scale.

63×spread · p10 → p90

Budget · p10Llama 3.2 90B Vision Instruct$1.20
MedianNemotron Ultra 253B$6.00
Premium · p90Claude Opus 4$75.00

20 models with verified pricing & quality

Browse provider pricing

Build on it

The data that powers the leaderboard, exposed as a free public API

pricing.ts

curl-friendly · zero auth

// Get the cheapest provider for Llama 3.1 70B
const res = await fetch(
  'https://inferencebench.io/api/v1/models/meta-llama/llama-3.1-70b/pricing'
);
const { providers } = await res.json();

providers
  .sort((a, b) => a.output_per_m - b.output_per_m)
  .slice(0, 3)
  .forEach((p) =>
    console.log(`${p.provider} · $${p.output_per_m}/M`)
  );

// DeepInfra   · $0.40/M
// Groq        · $0.79/M
// Openrouter  · $0.79/M

Public REST API

Build whatever you want on the data

Same data the leaderboard renders from — exposed as plain JSON. No API key. Rate-limit generous.

GET/api/v1/models
297
GET/api/v1/gpus
60
GET/api/v1/providers
19
GET/api/v1/pricing
live snapshots
GET/api/v1/leaderboard
9 categories

Read the API docs

Tools & calculators

Go deeper into the parts you care about — every page is free, no signup

Inference cost calculator

Pick a model, set throughput, compare $/M tokens across 10 providers.

319 models · 60 GPUs · 19 providers

Training cost calculator

23 training methods, real GPU pricing, full epoch & GPU-hour breakdown.

LoRA · QLoRA · FSDP · ZeRO-3 · DPO

Compare models

Side-by-side: quality, latency, $/M tokens, context window, license.

Pin up to 4 models · radar + tables

GPU catalog

Memory bandwidth, FP8 FLOPS, TDP, MSRP and current $/hr.

60 GPUs · NVIDIA, AMD, Intel, TPU

Provider directory

Pricing history, reliability score, regions, features per provider.

19 providers · 12-month price history

Workload matcher

Describe what you’re building, get a ranked shortlist of model + GPU.

8 workload presets · custom inputs

Methodology & trust

How we measure, where the data comes from, how to build on it

How we measure

Roofline model, kernel-level perf, KV-cache memory math.

10 CUDA kernel models

Data provenance

Every number traced back to its source, with refresh dates.

Ladder of justification

Benchmark docs

Run our 18-point test matrix yourself with vLLM in Docker.

STANDARD_TEST_MATRIX · GH Actions

Developer API

Free public REST API · models, GPUs, providers, pricing.

No key required · /api/v1

Frequently asked questions

Common questions about the benchmark, methodology, and how the data is sourced

What is an AI inference benchmark?

An AI inference benchmark measures how fast a GPU or cloud provider can generate tokens from a large language model (LLM). Key metrics include tokens per second (throughput), time to first token (TTFT), inter-token latency (ITL), and cost per million tokens.

How does InferenceBench measure GPU performance?

InferenceBench uses a roofline performance model combined with CUDA kernel-level modeling (FlashAttention, PagedAttention, fused kernels) to predict real-world inference throughput. Results are validated against actual benchmarks from the HuggingFace LLM Perf Leaderboard and provider-reported data.

Which GPU is fastest for LLM inference?

Performance depends on model size. For large models (70B+), the NVIDIA B200 and H200 lead in throughput. For mid-size models (7B–30B), the H100 SXM offers the best price-performance. For budget deployments, the RTX 4090 and L40S are strong contenders.

How often is benchmark data updated?

Pricing data is refreshed every 6 hours via automated API calls to providers. Benchmark results are updated when new GPU hardware or model architectures are released. Community-submitted data is verified before inclusion.

Ready to pick a model?

The full live ranking is one click away — sort, filter, and compare every model by quality, cost, and value.

Open the leaderboard Open the calculator

Built with care · Open source · MIT-licensed data · No signup

The definitive ranking of 319 AI models — by quality, cost, and value.

Output prices fell −41% YoY.

Top picks

BGE Small EN v1.5

Llama 4 Behemoth

BGE Small EN v1.5

Market Map

Market Moves

Biggest provider-price spreads

Just announced

Class Leaders

Best for Code

Best for Math

Best for Reasoning

Best for Vision

Best Small (<15B)

Best Large (70B+)

Workloads

Provider spotlight

Sandbox

Chat live with any model

Compare two models on the same prompt

Blind-vote the head-to-heads

News & research

Nemotron Super 120B vs Ultra 253B: NVIDIA's Best Open-Weight Models Benchmarked

Qwen3 Coder: The Model That Does Everything Right

Whisper v3-Turbo on H100: 597x Realtime ASR Benchmark

FLUX.2-klein-4B on H100: Image Generation Benchmark

Gemma 4 vs the MoE Field: When a 31B Dense Model Wins and When It Doesn't

MiniMax M2.5: A 229B MoE Model That Defies Easy Judgment

Worked examples

Llama 3 70B 1M Context chatbot, 1B tokens / month

Llama 3 70B 1M Context FP8 on H100 PCIe, batch 1

Output-token prices · top 20 by quality

Build on it

Build whatever you want on the data

Tools & calculators

Inference cost calculator

Training cost calculator

Compare models

GPU catalog

Provider directory

Workload matcher

Methodology & trust

How we measure

Data provenance

Benchmark docs

Developer API

Frequently asked questions

Ready to pick a model?