Llama 3 70B 1M Context chatbot, 1B tokens / month
Two cost paths for the same workload: rent the API per-token from the cheapest provider, or rent 2× H100 PCIe 24×7 and serve it yourself.
Independent. Open-source. Every number traced to its source. Compare 319 models across 19 providers on 60 GPUs — the data desk for everyone deciding which model to ship.
Median $/1M output across the top 20 models, weighted by request volume. Milestone markers mark the major model releases.
Llama 3.2 · 1.2B · 128K ctx
Llama 3.1 · 70.6B · 128K ctx
Qwen 2.5 · 14.8B · 128K ctx
DeepSeek R1 · 1.5B · 128K ctx
DeepSeek R1 · 14.8B · 128K ctx
DeepSeek R1 · 32.8B · 128K ctx
DeepSeek R1 · 70.6B · 128K ctx
DeepSeek R1 · 671B MoE (37B active) · 128K ctx
Showing 319 of 319 models
Tracking 319 AI models across 60 GPUs and 19 providers, updated daily. The top-ranked model for overall quality is BGE Small EN v1.5 with a quality score of —, available from $0.00/million output tokens. Rankings use InferenceBench's composite scoring combining benchmark results (MMLU, HumanEval, GSM8K), inference cost, and throughput efficiency.
One winner per axis — best value, best quality, cheapest, fastest
Where every model sits on the price-vs-quality curve, and which few set the floor for everyone else.
| Model | $/M | Q | |
|---|---|---|---|
| 1 | Gemini 2.0 FlashPick | $0.400 | 80 |
| 2 | DeepSeek V3 | $0.420 | 81 |
| 3 | HelpSteer2 Llama 3.1 70B | $0.500 | 82 |
| 4 | Nemotron 70B | $0.880 | 83 |
| 5 | Llama 3.2 90B Vision Instruct | $1.20 | 84 |
| 6 | DeepSeek R1 | $2.19 | 88 |
| 7 | Grok-3 | $15 | 91 |
| 8 | Llama 4 Behemoth | $16 | 93 |
Biggest price drops and the freshest releases — past 30 days
Best for Code · Math · Reasoning · Vision · Small (<15B) · Large (70B+)
Curated model + GPU shortlist for the workload you're building
The top inference providers ranked by reputation







Reputation = pricing competitiveness · uptime · model coverage · feature parity
See all 10 providers →Chat with any model · compare two at once · vote in the arena
We just shipped a public pricing API — live $/M tokens across 19 providers in one call. No key required. Full schema in the docs…
Streamed responses, side-by-side providers, real measured throughput. No login.
RAG is the right move when answers must cite source documents that change daily — search beats fine-tuning here.
Use RAG for fresh, citable knowledge; fine-tune when you need a fixed persona or domain-specific style.
Watch both stream at once. See latency, $/M tokens and answer quality side-by-side.
SELECT c.id, SUM(o.amount) AS rev FROM customers c JOIN orders o ON o.customer_id = c.id WHERE…
WITH last_q AS (SELECT * FROM orders WHERE quarter(date) = …) SELECT customer_id, SUM(amount)…
Two anonymous models answer. You pick the winner. Elo ranking updates live.
Latest deep-dives from the benchmark team

Nemotron Ultra FP8 scores 9.47 MT-Bench, beating its own BF16 at 9.2. Super hits 6,567 tok/s. Both fail tool use and vision at 0%. Full SWOT analysis.
100% coding accuracy across 8 categories, 9.57 MT-Bench, 93% tool use, 8,407 tok/s. Our deployment evaluation for engineering team…
Whisper Large-v3-Turbo benchmarked on H100: 597x realtime transcription, 404x at batch=32, $0.00007/min self-hosted, but 44% hallu…
FLUX.2-klein-4B benchmarked on H100: 0.19s per image at 512x512, CLIP 0.335, 97% multi-GPU efficiency, and $0.0004/image self-host…
Gemma 4 31B scores 9.73/10 MT-Bench from 31B dense params. We compare it against Mixtral 8x22B and DeepSeek V3 on cost, latency, a…
MiniMax M2.5 229B MoE benchmarked on 8x H100: 8,876 tok/s peak, 100% needle-in-haystack, 87% tool use, but 1.57/10 MT-Bench. The f…
Pre-computed scenarios from the engine — Build vs Buy · Bottleneck · Forecast
Two cost paths for the same workload: rent the API per-token from the cheapest provider, or rent 2× H100 PCIe 24×7 and serve it yourself.
Decode is memory-bandwidth dominated at batch 1: every token reloads the full weight matrix. Compute sits idle. Splitting across 2 GPUs (TP=2) doubles the BW ceiling.
The spread between budget and premium $/M, today. Sparkline shows the sorted distribution on log scale.
The data that powers the leaderboard, exposed as a free public API
// Get the cheapest provider for Llama 3.1 70B
const res = await fetch(
'https://inferencebench.io/api/v1/models/meta-llama/llama-3.1-70b/pricing'
);
const { providers } = await res.json();
providers
.sort((a, b) => a.output_per_m - b.output_per_m)
.slice(0, 3)
.forEach((p) =>
console.log(`${p.provider} · $${p.output_per_m}/M`)
);
// DeepInfra · $0.40/M
// Groq · $0.79/M
// Openrouter · $0.79/MSame data the leaderboard renders from — exposed as plain JSON. No API key. Rate-limit generous.
/api/v1/models/api/v1/gpus/api/v1/providers/api/v1/pricing/api/v1/leaderboardGo deeper into the parts you care about — every page is free, no signup
Pick a model, set throughput, compare $/M tokens across 10 providers.
23 training methods, real GPU pricing, full epoch & GPU-hour breakdown.
Side-by-side: quality, latency, $/M tokens, context window, license.
Memory bandwidth, FP8 FLOPS, TDP, MSRP and current $/hr.
Pricing history, reliability score, regions, features per provider.
Describe what you’re building, get a ranked shortlist of model + GPU.
How we measure, where the data comes from, how to build on it
Roofline model, kernel-level perf, KV-cache memory math.
Every number traced back to its source, with refresh dates.
Run our 18-point test matrix yourself with vLLM in Docker.
Free public REST API · models, GPUs, providers, pricing.
Common questions about the benchmark, methodology, and how the data is sourced
An AI inference benchmark measures how fast a GPU or cloud provider can generate tokens from a large language model (LLM). Key metrics include tokens per second (throughput), time to first token (TTFT), inter-token latency (ITL), and cost per million tokens.
InferenceBench uses a roofline performance model combined with CUDA kernel-level modeling (FlashAttention, PagedAttention, fused kernels) to predict real-world inference throughput. Results are validated against actual benchmarks from the HuggingFace LLM Perf Leaderboard and provider-reported data.
Performance depends on model size. For large models (70B+), the NVIDIA B200 and H200 lead in throughput. For mid-size models (7B–30B), the H100 SXM offers the best price-performance. For budget deployments, the RTX 4090 and L40S are strong contenders.
Pricing data is refreshed every 6 hours via automated API calls to providers. Benchmark results are updated when new GPU hardware or model architectures are released. Community-submitted data is verified before inclusion.
The full live ranking is one click away — sort, filter, and compare every model by quality, cost, and value.
Built with care · Open source · MIT-licensed data · No signup