🏆 AI Model Performance Leaderboard

Compare 319AI models by quality, cost & value

319 Models·60 GPUs·19 Providers

Open Calculator Training Leaderboard Follow on LinkedIn

Just announcedNano Banana 2 (Gemini 3.1 Flash Image)·Jun. 2026 Nano Banana Pro (Gemini 3 Pro Image)·Jun. 2026 Claude Fable 5·Jun. 2026 Qwen3.7 Plus·Jun. 2026 Claude Opus 4.8·May 2026 Qwen3.7 Max·May 2026 Gemini 3.5 Flash·May 2026 Gemini 3.1 Flash Lite·May 202622 entries with verified release dates but no full architecture/pricing yet

#▲	Model	Params	Quality	Input $/M	Output $/M	Speedⓘ	Tokens/$ⓘ	Context	Providersⓘ	Reasoning×ⓘ	Valueⓘ	Badge	Actions	Released
🥇	Qwen 2.5 7BQwen 2.5	7.6B	70	$0.200	$0.200	27 tok/s	135.50 M	128K	4	—	350.0	Most Popular	ROI →	Sep. 2024
🥈	Qwen 3 8BQwen 3	8.2B	70	$0.200	$0.200	49 tok/s	243.90 M	128K	4	12.7×	350.0		ROI →	Apr. 2025
🥉	Qwen 2.5 1.5BQwen 2.5	1.5B	—	$0.027~	$0.027~	—	—	32K	—	—	1862.0		ROI →	Sep. 2024
4	Qwen 2.5 3BQwen 2.5	3.1B	58	$0.100	$0.100	49 tok/s*	490.00 M	32K	1	—	580.0		ROI →	Sep. 2024
5	Llama 3.1 8BLlama 3.1	8B	—	$0.180	$0.180	35 tok/s	193.24 M	128K	10	—	322.2		ROI →	Jul. 2024
6	Qwen 3 4BQwen 3	4B	57	$0.100	$0.100	12 tok/s	117.30 M	128K	1	3.7×	570.0		ROI →	Apr. 2025
7	Llama 3.2 3BLlama 3.2	3.2B	55	$0.060	$0.060	154 tok/s	2562.50 M	128K	3	—	916.7	Pareto Q×C×S	ROI →	Sep. 2024
8	Qwen 3 32BQwen 3	32.8B	74	$0.800	$0.800	76 tok/s	94.70 M	128K	7	11.5×	92.5		ROI →	Apr. 2025
9	Llama 3.2 1BLlama 3.2	1.2B	38	$0.030	$0.030	33 tok/s	1112.66 M	128K	5	—	1266.7	Pareto Q×C×S	ROI →	Sep. 2024
10	Llama 3 8BLlama 3	8B	63	$0.200	$0.200	—	—	8K	2	—	315.0		ROI →	Apr. 2024
11	HelpSteer2 Llama 3.1 70BLlama 3.1	70.6B	82	$0.500	$0.500	—	—	128K	5	—	164.0		ROI →	Aug. 2024
12	Llama 3.1 70BLlama 3.1	70.6B	75	$0.880	$0.880	37 tok/s	42.57 M	128K	8	—	85.2		ROI →	Jul. 2024
13	Llama 3.1 70B TurboLlama 3.1	70.6B	—	$0.880	$0.880	—	—	128K	2	—	56.8		ROI →	Jul. 2024
14	NV EmbedQA Mistral 7BNV EmbedQA	7.2B	—	$0.012	$0.012	157 tok/s*	13083.33 M	32K	1	—	4166.7		ROI →	Jun. 2024
15	E5 Mistral 7BE5	7.1B	—	$0.016	$0.016	160 tok/s*	10000.00 M	32K	1	—	3125.0		ROI →	Dec. 2023
16	Gemma 3 1BGemma 3	1B	35	$0.018~	$0.018~	—	—	32K	—	—	1955.1		ROI →	Mar. 2025
17	BioMistral 7BBioMistral	7.2B	—	$0.129~	$0.129~	—	—	32K	—	—	387.9		ROI →	Feb. 2024
18	Mistral 7BMistral	7.3B	56	$0.200	$0.200	—	—	32K	3	—	280.0		ROI →	Sep. 2023
19	TinyLlama 1.1B ChatTinyLlama	1.1B	—	$0.021~	$0.021~	—	—	2K	—	—	2412.1		ROI →	Jan. 2024
20	TinyLlama 1.1BTinyLlama	1.1B	—	$0.021~	$0.021~	—	—	2K	—	—	2412.1		ROI →	Jan. 2024
21	Qwen 2.5 14BQwen 2.5	14.8B	76	$0.400	$0.400	49 tok/s*	122.50 M	128K	2	—	190.0	Pareto Q×C×S	ROI →	Sep. 2024
22	Qwen 2.5 72BQwen 2.5	72.7B	77	$1.20	$1.20	21 tok/s	17.58 M	128K	6	—	64.2		ROI →	Sep. 2024
23	Phi 2Phi	2.7B	—	$0.054~	$0.054~	—	—	2K	—	—	931.0		ROI →	Dec. 2023
24	Qwen 2.5 32BQwen 2.5	32.5B	73	$0.800	$0.800	23 tok/s*	28.75 M	128K	2	—	91.3		ROI →	Sep. 2024
25	DeepSeek R1 Distill 1.5BDeepSeek R1	1.5B	42	$0.027~	$0.027~	—	—	128K	—	9.0×	1564.1		ROI →	Jan. 2025
26	DeepSeek R1 Distill 8BDeepSeek R1	8B	—	$0.200	$0.200	41 tok/s*	203.58 M	128K	1	—	440.0		ROI →	Jan. 2025
27	DeepSeek R1 Distill 14BDeepSeek R1	14.8B	—	$0.300	$0.300	22 tok/s	73.98 M	128K	1	7.8×	293.3		ROI →	Jan. 2025
28	DeepSeek R1 Distill 32BDeepSeek R1	32.8B	—	$0.600	$0.600	—	—	128K	3	8.3×	146.7		ROI →	Jan. 2025
29	DeepSeek R1 Distill 70BDeepSeek R1	70.6B	—	$0.880	$0.880	31 tok/s	34.90 M	128K	6	2.4×	100.0		ROI →	Jan. 2025
30	DeepSeek R1DeepSeek R1	671B	88	$0.550	$2.19	37 tok/s	16.91 M	128K	5	19.0×	40.2	Pareto Q×C×S	ROI →	Jan. 2025
31	DeepSeek V3-0324DeepSeek V3	685B	—	$0.280	$0.420	13 tok/s	30.23 M	128K	8	2.8×	192.9		ROI →	Mar. 2025
32	DeepSeek V3DeepSeek V3	671B	81	$0.280	$0.420	20 tok/s	46.53 M	128K	5	3.2×	192.9	Pareto Q×C×S	ROI →	Dec. 2024
33	Llama 3 70BLlama 3	70.6B	80	$0.880	$0.880	19 tok/s	21.66 M	8K	4	—	90.9		ROI →	Apr. 2024
34	Llama 3 70B 1M ContextLlama 3	70.6B	—	$1.50	$1.50	—	—	1024K	2	—	33.3		ROI →	Jun. 2024
35	Mixtral 8x7B InstructMixtral	46.7B	69	$0.240	$0.240	47 tok/s	196.04 M	32K	2	—	287.5		ROI →	Jan. 2024
36	Mixtral 8x7BMixtral	46.7B	67	$0.600	$0.600	47 tok/s	78.42 M	32K	2	—	111.7		ROI →	Dec. 2023
37	Gemma 2 9BGemma 2	9.2B	68	$0.200	$0.200	9 tok/s*	45.00 M	8K	3	—	340.0		ROI →	Jun. 2024
38	Phi-4Phi	14.7B	73	$0.070	$0.140	35 tok/s	250.00 M	16K	3	—	521.4	Pareto Q×C×S	ROI →	Dec. 2024
39	Phi 4 MiniPhi	3.8B	70	$0.080	$0.350	—	—	128K	1	—	200.0		ROI →	Feb. 2025
40	Llama 2 7BLlama 2	7B	40	$0.125~	$0.125~	—	—	4K	—	—	319.2		ROI →	Jul. 2023
41	Llama 2 13BLlama 2	13B	47	$0.233~	$0.233~	—	—	4K	—	—	202.0		ROI →	Jul. 2023
42	Llama 2 70BLlama 2	70B	62	$0.900	$0.900	6 tok/s	6.54 M	4K	1	—	68.9		ROI →	Jul. 2023
43	Llama Guard 3 1BLlama Guard	1B	—	$0.019~	$0.019~	—	—	128K	—	—	2653.3		ROI →	Dec. 2024
44	Llama 3.3 8BLlama 3.3	8B	—	$0.180	$0.180	83 tok/s*	461.11 M	128K	1	—	277.8		ROI →	Jan. 2025
45	Llama Guard 3 8BLlama Guard	8B	—	$0.200	$0.200	—	—	128K	2	—	250.0		ROI →	Jul. 2024
46	Llama 4 ScoutLlama 4	109B	73	$0.180	$0.300	75 tok/s	250.54 M	10240K	5	—	243.3	Best Context	ROI →	Apr. 2025
47	Code Llama 13BCode Llama	13B	44	$0.220	$0.220	88 tok/s*	400.00 M	16K	1	—	200.0		ROI →	Aug. 2023
48	Code Llama 7BCode Llama	7B	39	$0.200	$0.200	163 tok/s*	815.00 M	16K	2	—	195.0		ROI →	Aug. 2023
49	Llama 3.1 Nemotron 51BLlama 3.1	51B	78	$0.400	$0.400	15 tok/s*	37.50 M	128K	1	—	195.0	Pareto Q×C×S	ROI →	Oct. 2024
50	Llama 3.1 Nemotron 70B RewardLlama 3.1	70.6B	80	$0.500	$0.500	15 tok/s*	30.00 M	128K	1	—	160.0		ROI →	Oct. 2024

Showing 1–50 of 319 models

Data freshness:pricing: 1 mo agolatency: 1 mo agoquality benchmarks: 1 yr ago

Tracking 319 AI models across 60 GPUs and 19 providers, updated daily. The top-ranked model for overall quality is BGE Small EN v1.5 with a quality score of —, available from $0.00/million output tokens. Rankings use InferenceBench's composite scoring combining benchmark results (MMLU, HumanEval, GSM8K), inference cost, and throughput efficiency.

Frequently Asked Questions

What is an AI inference benchmark?

An AI inference benchmark measures how fast a GPU or cloud provider can generate tokens from a large language model (LLM). Key metrics include tokens per second (throughput), time to first token (TTFT), inter-token latency (ITL), and cost per million tokens.

How does InferenceBench measure GPU performance?

InferenceBench uses a roofline performance model combined with CUDA kernel-level modeling (FlashAttention, PagedAttention, fused kernels) to predict real-world inference throughput. Results are validated against actual benchmarks from the HuggingFace LLM Perf Leaderboard and provider-reported data.

Which GPU is fastest for LLM inference?

Performance depends on model size. For large models (70B+), the NVIDIA B200 and H200 lead in throughput. For mid-size models (7B-30B), the H100 SXM offers the best price-performance. For budget deployments, the RTX 4090 and L40S are strong contenders.

How often is benchmark data updated?

Pricing data is refreshed every 6 hours via automated API calls to providers. Benchmark results are updated when new GPU hardware or model architectures are released. Community-submitted data is verified before inclusion.

Ready to calculate your inference costs?

Open the Calculator

Built with care · Open Source · Inference Bench