Llama 3 70B 1M Context chatbot, 1B tokens / month
Two cost paths for the same workload: rent the API per-token from the cheapest provider, or rent 2× H100 PCIe 24×7 and serve it yourself.
Independent. Open-source. Every number traced to its source. Compare 338 models across 19 providers on 60 GPUs — the data desk for everyone deciding which model to ship.
Median $/1M output across the top 20 models, weighted by request volume. Milestone markers mark the major model releases.
Llama 3.2 · 1.2B · 128K ctx
Llama 3.1 · 70.6B · 128K ctx
Qwen 2.5 · 14.8B · 128K ctx
DeepSeek R1 · 1.5B · 128K ctx
DeepSeek R1 · 14.8B · 128K ctx
DeepSeek R1 · 32.8B · 128K ctx
DeepSeek R1 · 70.6B · 128K ctx
DeepSeek R1 · 671B MoE (37B active) · 128K ctx
Showing 330 of 330 models
Tracking 338 AI models across 60 GPUs and 19 providers, updated daily. The top-ranked model for overall quality is BGE Small EN v1.5 with a quality score of —, available from $0.00/million output tokens. Rankings use InferenceBench's composite scoring combining benchmark results (MMLU, HumanEval, GSM8K), inference cost, and throughput efficiency.
One winner per axis — best value, best quality, cheapest, fastest
Where every model sits on the price-vs-quality curve, and which few set the floor for everyone else.
| Model | $/M | Q | |
|---|---|---|---|
| 1 | Gemini 2.0 FlashPick | $0.400 | 80 |
| 2 | DeepSeek V3 | $0.420 | 81 |
| 3 | HelpSteer2 Llama 3.1 70B | $0.500 | 82 |
| 4 | Nemotron 70B | $0.880 | 83 |
| 5 | Llama 3.2 90B Vision Instruct | $1.20 | 84 |
| 6 | DeepSeek R1 | $2.19 | 88 |
| 7 | Grok-3 | $15 | 91 |
| 8 | Llama 4 Behemoth | $16 | 93 |
Biggest price drops and the freshest releases — past 30 days
Best for Code · Math · Reasoning · Vision · Small (<15B) · Large (70B+)
Curated model + GPU shortlist for the workload you're building
The top inference providers ranked by reputation







Reputation = pricing competitiveness · uptime · model coverage · feature parity
See all 10 providers →Chat with any model · compare two at once · vote in the arena
We just shipped a public pricing API — live $/M tokens across 19 providers in one call. No key required. Full schema in the docs…
Streamed responses, side-by-side providers, real measured throughput. No login.
RAG is the right move when answers must cite source documents that change daily — search beats fine-tuning here.
Use RAG for fresh, citable knowledge; fine-tune when you need a fixed persona or domain-specific style.
Watch both stream at once. See latency, $/M tokens and answer quality side-by-side.
SELECT c.id, SUM(o.amount) AS rev FROM customers c JOIN orders o ON o.customer_id = c.id WHERE…
WITH last_q AS (SELECT * FROM orders WHERE quarter(date) = …) SELECT customer_id, SUM(amount)…
Two anonymous models answer. You pick the winner. Elo ranking updates live.
Latest deep-dives from the benchmark team

100% coding accuracy across 8 categories, 9.57 MT-Bench, 93% tool use, 8,407 tok/s. Our deployment evaluation for engineering teams considering self-hosted code AI.
Nemotron Ultra FP8 scores 9.47 MT-Bench, beating its own BF16 at 9.2. Super hits 6,567 tok/s. Both fail tool use and vision at 0%.…
Whisper Large-v3-Turbo benchmarked on H100: 597x realtime transcription, 404x at batch=32, $0.00007/min self-hosted, but 44% hallu…
GPU memory is the defining bottleneck of AI infrastructure. We analyze the demand curve from HBM3e through HBM4E, forecast require…
NVIDIA Rubin brings HBM4, NVLink 6, and 2x Blackwell performance. Paired with the Vera ARM CPU, it reshapes AI inference economics…
MiniMax M2.7 456B MoE on 8x H100: 9,854 tok/s peak, 93% tool use, but MT-Bench dropped to 1.30. Bigger is not always better.
Pre-computed scenarios from the engine — Build vs Buy · Bottleneck · Forecast
Two cost paths for the same workload: rent the API per-token from the cheapest provider, or rent 2× H100 PCIe 24×7 and serve it yourself.
Decode is memory-bandwidth dominated at batch 1: every token reloads the full weight matrix. Compute sits idle. Splitting across 2 GPUs (TP=2) doubles the BW ceiling.
The spread between budget and premium $/M, today. Sparkline shows the sorted distribution on log scale.
The data that powers the leaderboard, exposed as a free public API
// Get the cheapest provider for Llama 3.1 70B
const res = await fetch(
'https://inferencebench.io/api/v1/models/meta-llama/llama-3.1-70b/pricing'
);
const { providers } = await res.json();
providers
.sort((a, b) => a.output_per_m - b.output_per_m)
.slice(0, 3)
.forEach((p) =>
console.log(`${p.provider} · $${p.output_per_m}/M`)
);
// DeepInfra · $0.40/M
// Groq · $0.79/M
// Openrouter · $0.79/MSame data the leaderboard renders from — exposed as plain JSON. No API key. Rate-limit generous.
/api/v1/models/api/v1/gpus/api/v1/providers/api/v1/pricing/api/v1/leaderboardGo deeper into the parts you care about — every page is free, no signup
Pick a model, set throughput, compare $/M tokens across 10 providers.
23 training methods, real GPU pricing, full epoch & GPU-hour breakdown.
Side-by-side: quality, latency, $/M tokens, context window, license.
Memory bandwidth, FP8 FLOPS, TDP, MSRP and current $/hr.
Pricing history, reliability score, regions, features per provider.
Describe what you’re building, get a ranked shortlist of model + GPU.
How we measure, where the data comes from, how to build on it
Roofline model, kernel-level perf, KV-cache memory math.
Every number traced back to its source, with refresh dates.
Run our 18-point test matrix yourself with vLLM in Docker.
Free public REST API · models, GPUs, providers, pricing.
Common questions about the benchmark, methodology, and how the data is sourced
An AI inference benchmark measures how fast a GPU or cloud provider can generate tokens from a large language model (LLM). Key metrics include tokens per second (throughput), time to first token (TTFT), inter-token latency (ITL), and cost per million tokens.
InferenceBench uses a roofline performance model combined with CUDA kernel-level modeling (FlashAttention, PagedAttention, fused kernels) to predict real-world inference throughput. Results are validated against actual benchmarks from the HuggingFace LLM Perf Leaderboard and provider-reported data.
Performance depends on model size. For large models (70B+), the NVIDIA B200 and H200 lead in throughput. For mid-size models (7B–30B), the H100 SXM offers the best price-performance. For budget deployments, the RTX 4090 and L40S are strong contenders.
Pricing data is refreshed every 6 hours via automated API calls to providers. Benchmark results are updated when new GPU hardware or model architectures are released. Community-submitted data is verified before inclusion.
The full live ranking is one click away — sort, filter, and compare every model by quality, cost, and value.
Built with care · Open source · MIT-licensed data · No signup