1. Inference Performance Calculation

InferenceBench predicts inference throughput using a roofline model adapted for transformer workloads. The roofline model identifies whether a given model/GPU combination is compute-bound or memory-bandwidth-bound by comparing achievable FLOP/s against the memory bandwidth ceiling.

On top of the roofline baseline, we apply CUDA kernel-level modeling for 10 kernel types including FlashAttention, PagedAttention, fused GEMM, quantized matmul, and RoPE kernels. Each kernel model accounts for:

Arithmetic intensity (FLOPs per byte of memory traffic)
Occupancy and warp scheduling efficiency on the target GPU architecture
Memory access patterns (coalesced vs. strided, L2 cache hit rates)
Kernel fusion effects (e.g., fused QKV projection, SwiGLU activation fusion)

The final throughput estimate is the harmonic mean of per-layer kernel predictions, weighted by each layer’s share of total compute.

2. Memory Estimation Methodology

GPU memory (VRAM) usage during inference has three primary components:

Model weights: Parameter count multiplied by bytes-per-parameter for the chosen precision (FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes, etc.). We include quantization group metadata overhead where applicable.
KV-cache: Per-layer key and value tensors sized as 2 x num_layers x num_kv_heads x head_dim x seq_len x bytes_per_element, scaled by the number of concurrent sequences (batch size). For models using Grouped Query Attention (GQA) or Multi-Query Attention (MQA), the KV-head count is reduced accordingly.
Activation memory: Intermediate tensors during the forward pass. We estimate peak activation memory per layer and account for in-place operations that reduce the footprint.

A fixed overhead (typically 5-10% of total) is added for CUDA context, framework buffers, and memory fragmentation. The sum determines whether a model fits in a single GPU or requires tensor parallelism across multiple devices.

3. Throughput Prediction

We predict tokens per second in two phases:

Prefill (prompt processing): Compute-bound phase where all input tokens are processed in parallel. Throughput is limited by GPU TFLOP/s and model FLOPs per token.
Decode (autoregressive generation): Memory-bandwidth-bound phase where one token is generated at a time. Throughput is limited by how fast model weights can be read from VRAM.

Batch size scaling is modeled using an efficiency curve: throughput increases near-linearly at small batch sizes (memory-bandwidth phase) and plateaus as the GPU saturates its compute capacity. We use empirical correction factors derived from vLLM and TensorRT-LLM benchmarks to calibrate the curve for each GPU family (Ampere, Ada Lovelace, Hopper, Blackwell).

4. Cost Calculation Methodology

Our cost model covers three pricing paradigms:

Per-token pricing: Direct $/M-token rates from inference API providers (e.g., OpenAI, Anthropic, Together AI, Fireworks). We track separate input and output token prices, plus reasoning-token surcharges for models like DeepSeek R1.
Per-GPU-hour pricing: Cloud GPU rental rates (on-demand, reserved, spot) converted to per-token cost by dividing by predicted throughput.
Self-hosted TCO: Capital expenditure (GPU purchase), power consumption (GPU TDP + cooling overhead), amortization period, rack space, and networking costs rolled into a per-token equivalent.

ROI calculations compare self-hosted TCO against API pricing at the user’s specified traffic volume and time horizon to determine break-even points and cumulative savings.

5. Data Sources

Model specifications: Official model cards from HuggingFace, architecture configs from model repositories, and published technical reports.
GPU specifications: Official NVIDIA, AMD, and Intel datasheets for compute (TFLOP/s at each precision), memory bandwidth (GB/s), VRAM capacity, and TDP.
Provider pricing: Scraped from official pricing pages and provider APIs, refreshed regularly. Community-submitted pricing reports supplement automated collection.
Quality benchmarks: MMLU, HumanEval, GSM8K, and other evaluation scores sourced from the Open LLM Leaderboard, published papers, and official model documentation.
Community reports: User-submitted benchmark results, pricing observations, and experience reports validated through our crowdsource pipeline.

6. Update Frequency

Model catalog: Updated within 48 hours of major model releases.
GPU catalog: Updated when new GPU SKUs are announced or specifications are revised.
Provider pricing: Automated refresh daily; community reports processed continuously.
Benchmark scores: Updated as new evaluation results become available on public leaderboards.
Engine models: Recalibrated quarterly against new empirical benchmark data.

7. Limitations and Disclaimers

All performance predictions are estimates based on analytical models, not measured benchmarks on every possible hardware/software combination.
Real-world throughput varies with serving framework (vLLM, TensorRT-LLM, TGI), driver version, system configuration, input distribution, and concurrent load.
Cost estimates use published pricing which may differ from negotiated enterprise rates or promotional discounts.
Memory estimates assume standard serving configurations. Custom optimizations (e.g., speculative decoding, prefix caching) may alter actual usage.
We do not guarantee accuracy for unreleased or heavily customized model architectures.
InferenceBench is not affiliated with any GPU vendor, cloud provider, or model developer.

8. How to Contribute Data

We welcome community contributions to improve our data and models:

Pricing reports: Share your GPU pricing observations and provider experiences through the Share Your Configuration form on our Community page.
Benchmark results: Share your real-world inference performance measurements (throughput, latency, VRAM usage) via the Community page to help validate our predictions.
Bug reports and corrections: If you spot inaccurate specs, outdated pricing, or calculation errors, raise a ticket through our Support page.
Model requests: Request new models or GPUs to be added through our Support page.

9. Leaderboard Column Sources

Every cell on the public leaderboard is either a verified value with provenance metadata, or the sentinel —. We never display modeled, estimated, or interpolated values on the public leaderboard. Modeled values exist only inside the calculator engine. The table below documents the source type, derivation, and staleness rules for each column. Hovering any cell in the leaderboard exposes the underlying citation.

Column	Source type	Formula / file	Staleness
Params	OFFICIAL	Per-model JSON in `src/data/models/`; verified at import via Zod.	None (immutable architecture spec)
Quality	OFFICIAL	`src/data/quality-scores.json` — each score carries `{value, source_url, source_date}`. Scores without a `source_url` render as `—`. Loader: `src/data/quality-scores-loader.ts`.	None (benchmarks are one-time publications)
Input $/M	PROVIDER	Scraped via `src/lib/pricing-refresh.ts` — each row carries `verified_at` and `source_url`.	>7d: yellow tint; >30d: `—`
Output $/M	PROVIDER	Same pipeline as Input $/M.	>7d: yellow tint; >30d: `—`
Context	OFFICIAL	Per-model JSON in `src/data/models/` (model-card link).	None
Speed (TTFT + tok/s)	MEASURED	Probe scheduler (Phase 3) writes to `data/latency-snapshots/`; aggregator in `src/engine/latency.ts` (planned).	>14d: yellow tint; >60d: `—`; no probe in 48h: `—`
Tok/s per $	DERIVED (MEASURED ÷ PROVIDER)	`measured_throughput / output_price_per_M` — cell is `—` when either input is missing or stale.	Inherits worst of inputs
Providers	MEASURED (probe-gated)	Count of `model.providers[]` with a successful probe in the last 7 days.	Provider with no successful probe in 7d does not count
Reasoning expansion	MEASURED	100-prompt eval suite in `src/data/reasoning-eval-suite/prompts.json` (planned). Mean output-to-prompt token multiplier per effort level.	No run in 60d: `—`
Value	DERIVED (OFFICIAL ÷ PROVIDER)	`qualityScore / outputPerM` — engine helper in `src/engine/leaderboard.ts`.	Inherits worst of inputs

Provenance schema

Each verified cell is bound to a Provenance record (see src/lib/provenance.ts):

type ProvenanceSource = "OFFICIAL" | "PROVIDER" | "MEASURED";

interface Provenance {
  source: ProvenanceSource;
  url: string;             // citation (provider page, paper, run log)
  timestamp: string;       // ISO 8601 — when verified / scraped / measured
  sampleSize?: number;     // MEASURED only — N prompts in the run
}

Staleness thresholds

Pricing > 7 days old: yellow “stale” tint (constant PRICING_STALE_YELLOW_MS)
Pricing > 30 days old: cell becomes — (constant PRICING_STALE_DROP_MS)
Measured perf > 14 days old: yellow tint (MEASURED_STALE_YELLOW_MS)
Measured perf > 60 days old: — (MEASURED_STALE_DROP_MS)
No probe in the last 48 hours for a live-speed cell: —

Why some cells show “—”

We refuse to guess. A cell is a dash when the value would otherwise be modeled, estimated, interpolated, or sourced from a citation we can't verify. Tooltips on hover explain the specific reason: “no published score”, “not currently measured”, or “data older than retention window”.

10. References

InferenceBench's methodology draws on the following authoritative technical sources:

NVIDIA GPU Architecture Documentation— Official specifications for Ampere, Ada Lovelace, Hopper, and Blackwell architectures including compute throughput (TFLOPS), memory bandwidth, and Tensor Core capabilities. developer.nvidia.com/cuda-gpus

HuggingFace Open LLM Leaderboard— Community-standard evaluation framework for large language models covering MMLU, HellaSwag, ARC, TruthfulQA, and other benchmarks. Quality scores referenced in our model rankings are sourced from this leaderboard and verified against original publications. huggingface.co/open-llm-leaderboard

vLLM — PagedAttention— Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (SOSP 2023). Our KV-cache memory model and batch scheduling assumptions are informed by PagedAttention's block-based memory management approach. docs.vllm.ai

FlashAttention & FlashAttention-2— Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (NeurIPS 2022) and “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning” (2023). Our CUDA kernel performance models for attention layers are calibrated against FlashAttention throughput characteristics. github.com/Dao-AILab/flash-attention

Roofline Model— Williams, Waterman & Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures” (2009). The foundational performance modeling framework used to determine compute- vs. memory-bandwidth-bound regimes for transformer inference.

Benchmark Methodology