1. Inference Performance Calculation

InferenceBench predicts inference throughput using a roofline model adapted for transformer workloads. The roofline model identifies whether a given model/GPU combination is compute-bound or memory-bandwidth-bound by comparing achievable FLOP/s against the memory bandwidth ceiling.

On top of the roofline baseline, we apply CUDA kernel-level modeling for 10 kernel types including FlashAttention, PagedAttention, fused GEMM, quantized matmul, and RoPE kernels. Each kernel model accounts for:

Arithmetic intensity (FLOPs per byte of memory traffic)
Occupancy and warp scheduling efficiency on the target GPU architecture
Memory access patterns (coalesced vs. strided, L2 cache hit rates)
Kernel fusion effects (e.g., fused QKV projection, SwiGLU activation fusion)

The final throughput estimate is the harmonic mean of per-layer kernel predictions, weighted by each layer’s share of total compute.

2. Memory Estimation Methodology

GPU memory (VRAM) usage during inference has three primary components:

Model weights: Parameter count multiplied by bytes-per-parameter for the chosen precision (FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes, etc.). We include quantization group metadata overhead where applicable.
KV-cache: Per-layer key and value tensors sized as 2 x num_layers x num_kv_heads x head_dim x seq_len x bytes_per_element, scaled by the number of concurrent sequences (batch size). For models using Grouped Query Attention (GQA) or Multi-Query Attention (MQA), the KV-head count is reduced accordingly.
Activation memory: Intermediate tensors during the forward pass. We estimate peak activation memory per layer and account for in-place operations that reduce the footprint.

A fixed overhead (typically 5-10% of total) is added for CUDA context, framework buffers, and memory fragmentation. The sum determines whether a model fits in a single GPU or requires tensor parallelism across multiple devices.

3. Throughput Prediction

We predict tokens per second in two phases:

Prefill (prompt processing): Compute-bound phase where all input tokens are processed in parallel. Throughput is limited by GPU TFLOP/s and model FLOPs per token.
Decode (autoregressive generation): Memory-bandwidth-bound phase where one token is generated at a time. Throughput is limited by how fast model weights can be read from VRAM.

Batch size scaling is modeled using an efficiency curve: throughput increases near-linearly at small batch sizes (memory-bandwidth phase) and plateaus as the GPU saturates its compute capacity. We use empirical correction factors derived from vLLM and TensorRT-LLM benchmarks to calibrate the curve for each GPU family (Ampere, Ada Lovelace, Hopper, Blackwell).

4. Cost Calculation Methodology

Our cost model covers three pricing paradigms:

Per-token pricing: Direct $/M-token rates from inference API providers (e.g., OpenAI, Anthropic, Together AI, Fireworks). We track separate input and output token prices, plus reasoning-token surcharges for models like DeepSeek R1.
Per-GPU-hour pricing: Cloud GPU rental rates (on-demand, reserved, spot) converted to per-token cost by dividing by predicted throughput.
Self-hosted TCO: Capital expenditure (GPU purchase), power consumption (GPU TDP + cooling overhead), amortization period, rack space, and networking costs rolled into a per-token equivalent.

ROI calculations compare self-hosted TCO against API pricing at the user’s specified traffic volume and time horizon to determine break-even points and cumulative savings.

5. Data Sources

Model specifications: Official model cards from HuggingFace, architecture configs from model repositories, and published technical reports.
GPU specifications: Official NVIDIA, AMD, and Intel datasheets for compute (TFLOP/s at each precision), memory bandwidth (GB/s), VRAM capacity, and TDP.
Provider pricing: Scraped from official pricing pages and provider APIs, refreshed regularly. Community-submitted pricing reports supplement automated collection.
Quality benchmarks: MMLU, HumanEval, GSM8K, and other evaluation scores sourced from the Open LLM Leaderboard, published papers, and official model documentation.
Community reports: User-submitted benchmark results, pricing observations, and experience reports validated through our crowdsource pipeline.

6. Update Frequency

Model catalog: Updated within 48 hours of major model releases.
GPU catalog: Updated when new GPU SKUs are announced or specifications are revised.
Provider pricing: Automated refresh daily; community reports processed continuously.
Benchmark scores: Updated as new evaluation results become available on public leaderboards.
Engine models: Recalibrated quarterly against new empirical benchmark data.

7. Limitations and Disclaimers

All performance predictions are estimates based on analytical models, not measured benchmarks on every possible hardware/software combination.
Real-world throughput varies with serving framework (vLLM, TensorRT-LLM, TGI), driver version, system configuration, input distribution, and concurrent load.
Cost estimates use published pricing which may differ from negotiated enterprise rates or promotional discounts.
Memory estimates assume standard serving configurations. Custom optimizations (e.g., speculative decoding, prefix caching) may alter actual usage.
We do not guarantee accuracy for unreleased or heavily customized model architectures.
InferenceBench is not affiliated with any GPU vendor, cloud provider, or model developer.

8. How to Contribute Data

We welcome community contributions to improve our data and models:

Pricing reports: Share your GPU pricing observations and provider experiences through the Share Your Configuration form on our Community page.
Benchmark results: Share your real-world inference performance measurements (throughput, latency, VRAM usage) via the Community page to help validate our predictions.
Bug reports and corrections: If you spot inaccurate specs, outdated pricing, or calculation errors, raise a ticket through our Support page.
Model requests: Request new models or GPUs to be added through our Support page.

9. References

InferenceBench's methodology draws on the following authoritative technical sources:

NVIDIA GPU Architecture Documentation— Official specifications for Ampere, Ada Lovelace, Hopper, and Blackwell architectures including compute throughput (TFLOPS), memory bandwidth, and Tensor Core capabilities. developer.nvidia.com/cuda-gpus
HuggingFace Open LLM Leaderboard— Community-standard evaluation framework for large language models covering MMLU, HellaSwag, ARC, TruthfulQA, and other benchmarks. Quality scores referenced in our model rankings are sourced from this leaderboard and verified against original publications. huggingface.co/open-llm-leaderboard
vLLM — PagedAttention— Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (SOSP 2023). Our KV-cache memory model and batch scheduling assumptions are informed by PagedAttention's block-based memory management approach. docs.vllm.ai
FlashAttention & FlashAttention-2— Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (NeurIPS 2022) and “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning” (2023). Our CUDA kernel performance models for attention layers are calibrated against FlashAttention throughput characteristics. github.com/Dao-AILab/flash-attention
Roofline Model— Williams, Waterman & Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures” (2009). The foundational performance modeling framework used to determine compute- vs. memory-bandwidth-bound regimes for transformer inference.

Benchmark Methodology