Benchmark Methodology
Last updated: April 1, 2026
1. Inference Performance Calculation
InferenceBench predicts inference throughput using a roofline model adapted for transformer workloads. The roofline model identifies whether a given model/GPU combination is compute-bound or memory-bandwidth-bound by comparing achievable FLOP/s against the memory bandwidth ceiling.
On top of the roofline baseline, we apply CUDA kernel-level modeling for 10 kernel types including FlashAttention, PagedAttention, fused GEMM, quantized matmul, and RoPE kernels. Each kernel model accounts for:
- Arithmetic intensity (FLOPs per byte of memory traffic)
- Occupancy and warp scheduling efficiency on the target GPU architecture
- Memory access patterns (coalesced vs. strided, L2 cache hit rates)
- Kernel fusion effects (e.g., fused QKV projection, SwiGLU activation fusion)
The final throughput estimate is the harmonic mean of per-layer kernel predictions, weighted by each layer’s share of total compute.
2. Memory Estimation Methodology
GPU memory (VRAM) usage during inference has three primary components:
- Model weights: Parameter count multiplied by bytes-per-parameter for the chosen precision (FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes, etc.). We include quantization group metadata overhead where applicable.
- KV-cache: Per-layer key and value tensors sized as
2 x num_layers x num_kv_heads x head_dim x seq_len x bytes_per_element, scaled by the number of concurrent sequences (batch size). For models using Grouped Query Attention (GQA) or Multi-Query Attention (MQA), the KV-head count is reduced accordingly. - Activation memory: Intermediate tensors during the forward pass. We estimate peak activation memory per layer and account for in-place operations that reduce the footprint.
A fixed overhead (typically 5-10% of total) is added for CUDA context, framework buffers, and memory fragmentation. The sum determines whether a model fits in a single GPU or requires tensor parallelism across multiple devices.
3. Throughput Prediction
We predict tokens per second in two phases:
- Prefill (prompt processing): Compute-bound phase where all input tokens are processed in parallel. Throughput is limited by GPU TFLOP/s and model FLOPs per token.
- Decode (autoregressive generation): Memory-bandwidth-bound phase where one token is generated at a time. Throughput is limited by how fast model weights can be read from VRAM.
Batch size scaling is modeled using an efficiency curve: throughput increases near-linearly at small batch sizes (memory-bandwidth phase) and plateaus as the GPU saturates its compute capacity. We use empirical correction factors derived from vLLM and TensorRT-LLM benchmarks to calibrate the curve for each GPU family (Ampere, Ada Lovelace, Hopper, Blackwell).
4. Cost Calculation Methodology
Our cost model covers three pricing paradigms:
- Per-token pricing: Direct $/M-token rates from inference API providers (e.g., OpenAI, Anthropic, Together AI, Fireworks). We track separate input and output token prices, plus reasoning-token surcharges for models like DeepSeek R1.
- Per-GPU-hour pricing: Cloud GPU rental rates (on-demand, reserved, spot) converted to per-token cost by dividing by predicted throughput.
- Self-hosted TCO: Capital expenditure (GPU purchase), power consumption (GPU TDP + cooling overhead), amortization period, rack space, and networking costs rolled into a per-token equivalent.
ROI calculations compare self-hosted TCO against API pricing at the user’s specified traffic volume and time horizon to determine break-even points and cumulative savings.
5. Data Sources
- Model specifications: Official model cards from HuggingFace, architecture configs from model repositories, and published technical reports.
- GPU specifications: Official NVIDIA, AMD, and Intel datasheets for compute (TFLOP/s at each precision), memory bandwidth (GB/s), VRAM capacity, and TDP.
- Provider pricing: Scraped from official pricing pages and provider APIs, refreshed regularly. Community-submitted pricing reports supplement automated collection.
- Quality benchmarks: MMLU, HumanEval, GSM8K, and other evaluation scores sourced from the Open LLM Leaderboard, published papers, and official model documentation.
- Community reports: User-submitted benchmark results, pricing observations, and experience reports validated through our crowdsource pipeline.
6. Update Frequency
- Model catalog: Updated within 48 hours of major model releases.
- GPU catalog: Updated when new GPU SKUs are announced or specifications are revised.
- Provider pricing: Automated refresh daily; community reports processed continuously.
- Benchmark scores: Updated as new evaluation results become available on public leaderboards.
- Engine models: Recalibrated quarterly against new empirical benchmark data.
7. Limitations and Disclaimers
- All performance predictions are estimates based on analytical models, not measured benchmarks on every possible hardware/software combination.
- Real-world throughput varies with serving framework (vLLM, TensorRT-LLM, TGI), driver version, system configuration, input distribution, and concurrent load.
- Cost estimates use published pricing which may differ from negotiated enterprise rates or promotional discounts.
- Memory estimates assume standard serving configurations. Custom optimizations (e.g., speculative decoding, prefix caching) may alter actual usage.
- We do not guarantee accuracy for unreleased or heavily customized model architectures.
- InferenceBench is not affiliated with any GPU vendor, cloud provider, or model developer.
8. How to Contribute Data
We welcome community contributions to improve our data and models:
- Pricing reports: Share your GPU pricing observations and provider experiences through the Share Your Configuration form on our Community page.
- Benchmark results: Share your real-world inference performance measurements (throughput, latency, VRAM usage) via the Community page to help validate our predictions.
- Bug reports and corrections: If you spot inaccurate specs, outdated pricing, or calculation errors, raise a ticket through our Support page.
- Model requests: Request new models or GPUs to be added through our Support page.
9. References
InferenceBench's methodology draws on the following authoritative technical sources: