Live pricing · updated daily

The definitive ranking of 338 AI models — by quality, cost, and value.

Independent. Open-source. Every number traced to its source. Compare 338 models across 19 providers on 60 GPUs — the data desk for everyone deciding which model to ship.

Open the leaderboard Cost calculator Use the API →

338Models

60GPUs

19Providers

12 moPrice history

FreePublic API

Open methodology

MIT-licensed

No signup

Refreshed daily

THE WIREEDITORS

0 / 24h

Pricing · last 12 months

Output prices fell −41% YoY.

Median $/1M output across the top 20 models, weighted by request volume. Milestone markers mark the major model releases.

Today

$2.14

12 mo ago

$3.62

Analysis

The Leaderboard

Open full view→

Just announcedNano Banana 2 (Gemini 3.1 Flash Image)·Jun. 2026 Nano Banana Pro (Gemini 3 Pro Image)·Jun. 2026 Kimi-K2.7-Code·Jun. 2026 Claude Fable 5·Jun. 2026 Qwen3.7 Plus·Jun. 2026 Claude Opus 4.8·May 2026 Universal_Audio_Tokenizer·May 2026 Qwen3.7 Max·May 202628 entries with verified release dates but no full architecture/pricing yet

#▲	Model	Params	Quality	Input $/M	Output $/M	Speedⓘ	Tokens/$ⓘ	Context	Providersⓘ	Reasoning×ⓘ	Valueⓘ	Badge	Actions	Released
🥇	Qwen 2.5 7BQwen 2.5	7.6B	70	$0.200	$0.200	27 tok/s	135.50 M	128K	4	—	350.0	Most Popular	ROI →	Sep. 2024
🥈	Qwen 3 8BQwen 3	8.2B	70	$0.200	$0.200	49 tok/s	243.90 M	128K	4	12.7×	350.0		ROI →	Apr. 2025
🥉	Qwen 2.5 1.5BQwen 2.5	1.5B	—	$0.027~	$0.027~	—	—	32K	—	—	1862.0		ROI →	Sep. 2024
4	Qwen 2.5 3BQwen 2.5	3.1B	58	$0.100	$0.100	49 tok/s*	490.00 M	32K	1	—	580.0		ROI →	Sep. 2024
5	Llama 3.1 8BLlama 3.1	8B	—	$0.180	$0.180	35 tok/s	193.24 M	128K	10	—	322.2		ROI →	Jul. 2024
6	Qwen 3 4BQwen 3	4B	57	$0.100	$0.100	12 tok/s	117.30 M	128K	1	3.7×	570.0		ROI →	Apr. 2025
7	Llama 3.2 3BLlama 3.2	3.2B	55	$0.060	$0.060	154 tok/s	2562.50 M	128K	3	—	916.7	Pareto Q×C×S	ROI →	Sep. 2024
8	Qwen 3 32BQwen 3	32.8B	74	$0.800	$0.800	76 tok/s	94.70 M	128K	7	11.5×	92.5		ROI →	Apr. 2025
9	Llama 3.2 1BLlama 3.2	1.2B	38	$0.030	$0.030	33 tok/s	1112.66 M	128K	5	—	1266.7	Pareto Q×C×S	ROI →	Sep. 2024
10	Llama 3 8BLlama 3	8B	63	$0.200	$0.200	—	—	8K	2	—	315.0		ROI →	Apr. 2024
11	HelpSteer2 Llama 3.1 70BLlama 3.1	70.6B	82	$0.500	$0.500	—	—	128K	5	—	164.0		ROI →	Aug. 2024
12	Llama 3.1 70BLlama 3.1	70.6B	75	$0.880	$0.880	37 tok/s	42.57 M	128K	8	—	85.2		ROI →	Jul. 2024
13	Llama 3.1 70B TurboLlama 3.1	70.6B	—	$0.880	$0.880	—	—	128K	2	—	56.8		ROI →	Jul. 2024
14	NV EmbedQA Mistral 7BNV EmbedQA	7.2B	—	$0.012	$0.012	157 tok/s*	13083.33 M	32K	1	—	4166.7		ROI →	Jun. 2024
15	Gemma 3 1BGemma 3	1B	35	$0.018~	$0.018~	—	—	32K	—	—	1955.1		ROI →	Mar. 2025
16	BioMistral 7BBioMistral	7.2B	—	$0.129~	$0.129~	—	—	32K	—	—	387.9		ROI →	Feb. 2024
17	Mistral 7BMistral	7.3B	56	$0.200	$0.200	—	—	32K	3	—	280.0		ROI →	Sep. 2023
18	e5-mistral-7b-instructintfloat	0B	—	—	—	—	—	—	—	—	—		ROI →	Dec. 2023
19	TinyLlama 1.1B ChatTinyLlama	1.1B	—	$0.021~	$0.021~	—	—	2K	—	—	2412.1		ROI →	Jan. 2024
20	TinyLlama 1.1BTinyLlama	1.1B	—	$0.021~	$0.021~	—	—	2K	—	—	2412.1		ROI →	Jan. 2024
21	Qwen 2.5 14BQwen 2.5	14.8B	76	$0.400	$0.400	49 tok/s*	122.50 M	128K	2	—	190.0	Pareto Q×C×S	ROI →	Sep. 2024
22	Qwen 2.5 72BQwen 2.5	72.7B	77	$1.20	$1.20	21 tok/s	17.58 M	128K	6	—	64.2		ROI →	Sep. 2024
23	Phi 2Phi	2.7B	—	$0.054~	$0.054~	—	—	2K	—	—	931.0		ROI →	Dec. 2023
24	Qwen 2.5 32BQwen 2.5	32.5B	73	$0.800	$0.800	23 tok/s*	28.75 M	128K	2	—	91.3		ROI →	Sep. 2024
25	DeepSeek R1 Distill 1.5BDeepSeek R1	1.5B	42	$0.027~	$0.027~	—	—	128K	—	9.0×	1564.1		ROI →	Jan. 2025
26	DeepSeek R1 Distill 8BDeepSeek R1	8B	—	$0.200	$0.200	41 tok/s*	203.58 M	128K	1	—	440.0		ROI →	Jan. 2025
27	DeepSeek R1 Distill 14BDeepSeek R1	14.8B	—	$0.300	$0.300	22 tok/s	73.98 M	128K	1	7.8×	293.3		ROI →	Jan. 2025
28	DeepSeek R1 Distill 32BDeepSeek R1	32.8B	—	$0.600	$0.600	—	—	128K	3	8.3×	146.7		ROI →	Jan. 2025
29	DeepSeek R1 Distill 70BDeepSeek R1	70.6B	—	$0.880	$0.880	31 tok/s	34.90 M	128K	6	2.4×	100.0		ROI →	Jan. 2025
30	DeepSeek R1DeepSeek R1	671B	88	$0.550	$2.19	37 tok/s	16.91 M	128K	5	19.0×	40.2	Pareto Q×C×S	ROI →	Jan. 2025

Showing 1–30 of 330 models

…

Data freshness:pricing: 1 mo agolatency: 1 mo agoquality benchmarks: 1 yr ago

Tracking 338 AI models across 60 GPUs and 19 providers, updated daily. The top-ranked model for overall quality is BGE Small EN v1.5 with a quality score of —, available from $0.00/million output tokens. Rankings use InferenceBench's composite scoring combining benchmark results (MMLU, HumanEval, GSM8K), inference cost, and throughput efficiency.

Top picks

One winner per axis — best value, best quality, cheapest, fastest

Best value

Market Map

Where every model sits on the price-vs-quality curve, and which few set the floor for everyone else.

Open comparator→

As of 25 JUN 2026

$2.19/Mbuys quality 88

$150/Mfor the same tier

68× spread

30 plotted8 on frontierPareto$/1M out · log

On the Frontier

	Model	$/M	Q
1	Gemini 2.0 FlashPick	$0.400	80
2	DeepSeek V3	$0.420	81
3	HelpSteer2 Llama 3.1 70B	$0.500	82
4	Nemotron 70B	$0.880	83
5	Llama 3.2 90B Vision Instruct	$1.20	84
6	DeepSeek R1	$2.19	88
7	Grok-3	$15	91
8	Llama 4 Behemoth	$16	93

Cheapest 85+ quality$2.19/M

Median price plotted$4.00/M

Frontier vs median10.0× cheaper

Market Moves

Biggest price drops and the freshest releases — past 30 days

Biggest provider-price spreads

Switch & save

Qwen 3 Coder 8BCheapest on Alibaba
−92%$1.80 → $0.15
DeepSeek V3Cheapest on DeepSeek
−85%$2.80 → $0.42
Llama 3.2 1BCheapest on Together
−85%$0.20 → $0.03
Llama Guard 3 8BCheapest on Openrouter
−85%$0.20 → $0.03
Code Llama 7BCheapest on Together
−83%$1.20 → $0.20

Just announced

By release date

Workloads

Curated model + GPU shortlist for the workload you're building

Chatbot

Realtime UX · low p50

HumanEval-tuned picks

Long context · RAG-ready

Start with

Llama 3.2 1B$0.030/M

137 models matched

Real-time translation

Low-latency multilingual

Provider spotlight

The top inference providers ranked by reputation

Reputation = pricing competitiveness · uptime · model coverage · feature parity

See all 10 providers →

Sandbox

Chat with any model · compare two at once · vote in the arena

Open arena→

Draft a release note for our new pricing API.

Llama 3.1 70B· via Together

We just shipped a public pricing API — live $/M tokens across 19 providers in one call. No key required. Full schema in the docs…

Streamingswap providers · no login

Playground

Chat live with any model

Streamed responses, side-by-side providers, real measured throughput. No login.

Open

When should we use RAG vs fine-tuning?

GPT-5

RAG is the right move when answers must cite source documents that change daily — search beats fine-tuning here.

Claude Sonnet 4

Use RAG for fresh, citable knowledge; fine-tune when you need a fixed persona or domain-specific style.

both streaminglatency · $/M side-by-side

Head-to-head

Compare two models on the same prompt

Watch both stream at once. See latency, $/M tokens and answer quality side-by-side.

Open

Write a SQL query: top 10 customers by revenue last quarter.

Model A

SELECT c.id, SUM(o.amount) AS rev FROM customers c JOIN orders o ON o.customer_id = c.id WHERE…

Model B

WITH last_q AS (SELECT * FROM orders WHERE quarter(date) = …) SELECT customer_id, SUM(amount)…

Blind · identities revealed after voteElo updates live

Arena

Blind-vote the head-to-heads

Two anonymous models answer. You pick the winner. Elo ranking updates live.

Open

News & research

Latest deep-dives from the benchmark team

All posts→

Qwen3Apr 14, 202620 min read

Qwen3 Coder: The Model That Does Everything Right

100% coding accuracy across 8 categories, 9.57 MT-Bench, 93% tool use, 8,407 tok/s. Our deployment evaluation for engineering teams considering self-hosted code AI.

Read the full post

Latest research

All posts

Worked examples

Pre-computed scenarios from the engine — Build vs Buy · Bottleneck · Forecast

See all analyses→

Build vs Buy

Llama 3 70B 1M Context chatbot, 1B tokens / month

Two cost paths for the same workload: rent the API per-token from the cheapest provider, or rent 2× H100 PCIe 24×7 and serve it yourself.

API (cheapest provider)$740/mo

Self-host (2× H100 PCIe)$3.3k/mo

$2.6kspent extra/mo · -352% pricier to self-host

Open the calculator

Bottleneck X-Ray

Llama 3 70B 1M Context FP8 on H100 PCIe, batch 1

Decode is memory-bandwidth dominated at batch 1: every token reloads the full weight matrix. Compute sits idle. Splitting across 2 GPUs (TP=2) doubles the BW ceiling.

99%BW-bound

BW ceiling28 tok/s

Compute ceiling11k tok/s

+28 tok/sat TP=2 (2× BW)

See the X-Ray method

Where prices live

Output-token prices · top 20 by quality

The spread between budget and premium $/M, today. Sparkline shows the sorted distribution on log scale.

63×spread · p10 → p90

Budget · p10Llama 3.2 90B Vision Instruct$1.20
MedianNemotron Ultra 253B$6.00
Premium · p90Claude Opus 4$75.00

20 models with verified pricing & quality

Browse provider pricing

Build on it

The data that powers the leaderboard, exposed as a free public API

pricing.ts

curl-friendly · zero auth

// Get the cheapest provider for Llama 3.1 70B
const res = await fetch(
  'https://inferencebench.io/api/v1/models/meta-llama/llama-3.1-70b/pricing'
);
const { providers } = await res.json();

providers
  .sort((a, b) => a.output_per_m - b.output_per_m)
  .slice(0, 3)
  .forEach((p) =>
    console.log(`${p.provider} · $${p.output_per_m}/M`)
  );

// DeepInfra   · $0.40/M
// Groq        · $0.79/M
// Openrouter  · $0.79/M

Public REST API

Build whatever you want on the data

Same data the leaderboard renders from — exposed as plain JSON. No API key. Rate-limit generous.

GET/api/v1/models
302
GET/api/v1/gpus
60
GET/api/v1/providers
19
GET/api/v1/pricing
live snapshots
GET/api/v1/leaderboard
9 categories

Read the API docs

Tools & calculators

Go deeper into the parts you care about — every page is free, no signup

Inference cost calculator

Pick a model, set throughput, compare $/M tokens across 10 providers.

338 models · 60 GPUs · 19 providers

Training cost calculator

23 training methods, real GPU pricing, full epoch & GPU-hour breakdown.

LoRA · QLoRA · FSDP · ZeRO-3 · DPO

Compare models

Side-by-side: quality, latency, $/M tokens, context window, license.

Pin up to 4 models · radar + tables

GPU catalog

Memory bandwidth, FP8 FLOPS, TDP, MSRP and current $/hr.

60 GPUs · NVIDIA, AMD, Intel, TPU

Provider directory

Pricing history, reliability score, regions, features per provider.

19 providers · 12-month price history

Workload matcher

Describe what you’re building, get a ranked shortlist of model + GPU.

8 workload presets · custom inputs

Methodology & trust

How we measure, where the data comes from, how to build on it

How we measure

Roofline model, kernel-level perf, KV-cache memory math.

10 CUDA kernel models

Data provenance

Every number traced back to its source, with refresh dates.

Ladder of justification

Benchmark docs

Run our 18-point test matrix yourself with vLLM in Docker.

STANDARD_TEST_MATRIX · GH Actions

Developer API

Free public REST API · models, GPUs, providers, pricing.

No key required · /api/v1

Frequently asked questions

Common questions about the benchmark, methodology, and how the data is sourced

What is an AI inference benchmark?

An AI inference benchmark measures how fast a GPU or cloud provider can generate tokens from a large language model (LLM). Key metrics include tokens per second (throughput), time to first token (TTFT), inter-token latency (ITL), and cost per million tokens.

How does InferenceBench measure GPU performance?

InferenceBench uses a roofline performance model combined with CUDA kernel-level modeling (FlashAttention, PagedAttention, fused kernels) to predict real-world inference throughput. Results are validated against actual benchmarks from the HuggingFace LLM Perf Leaderboard and provider-reported data.

Which GPU is fastest for LLM inference?

Performance depends on model size. For large models (70B+), the NVIDIA B200 and H200 lead in throughput. For mid-size models (7B–30B), the H100 SXM offers the best price-performance. For budget deployments, the RTX 4090 and L40S are strong contenders.

How often is benchmark data updated?

Pricing data is refreshed every 6 hours via automated API calls to providers. Benchmark results are updated when new GPU hardware or model architectures are released. Community-submitted data is verified before inclusion.

Ready to pick a model?

The full live ranking is one click away — sort, filter, and compare every model by quality, cost, and value.

Open the leaderboard Open the calculator

Built with care · Open source · MIT-licensed data · No signup

The definitive ranking of 338 AI models — by quality, cost, and value.

Output prices fell −41% YoY.

Top picks

BGE Small EN v1.5

Llama 4 Behemoth

BGE Small EN v1.5

Market Map

Market Moves

Biggest provider-price spreads

Just announced

Class Leaders

Best for Code

Best for Math

Best for Reasoning

Best for Vision

Best Small (<15B)

Best Large (70B+)

Workloads

Provider spotlight

Sandbox

Chat live with any model

Compare two models on the same prompt

Blind-vote the head-to-heads

News & research

Qwen3 Coder: The Model That Does Everything Right

Nemotron Super 120B vs Ultra 253B: NVIDIA's Best Open-Weight Models Benchmarked

Whisper v3-Turbo on H100: 597x Realtime ASR Benchmark

The GPU Memory Wall: Forecasting AI Demand to 2028

NVIDIA Rubin and Vera: The Next GPU Revolution for AI Infrastructure

MiniMax M2.7: The Bigger MoE Paradox

Worked examples

Llama 3 70B 1M Context chatbot, 1B tokens / month

Llama 3 70B 1M Context FP8 on H100 PCIe, batch 1

Output-token prices · top 20 by quality

Build on it

Build whatever you want on the data

Tools & calculators

Inference cost calculator

Training cost calculator

Compare models

GPU catalog

Provider directory

Workload matcher

Methodology & trust

How we measure

Data provenance

Benchmark docs

Developer API

Frequently asked questions

Ready to pick a model?