Methodology · v1
How we compute InferenceScore™
Last updated 2026-04-19. We publish this page (and the equivalent arXiv PDF) so that any third party can reproduce our scores from raw data. The open benchmark code lives at github.com/inferencebench/bench (MIT-licensed).
The formula
score = 0.40 * price_component
+ 0.30 * performance_component
+ 0.20 * availability_component
+ 0.10 * reliability_componentEach component is normalised to 0–100 against the current corpus of tracked GPU × provider × region rows.
Components
Price (40%)
On-demand $/hr from our scrapers (P039–P041). Inverted and scaled so the cheapest row in the corpus gets 100 and the most expensive gets 0. Spot / preemptible rates are tracked separately and reported as a second chart, not folded into the base score.
Performance (30%)
Tokens/sec on a reference workload (Llama-3-8B, BF16, 128 in / 128 out, batch size 1 and 32) measured by our open benchmark harness. Where live telemetry isn't available yet (most providers at launch) we fall back to the GPU vendor's published FP16 TFLOPS × a fixed multiplier, flagged on the detail page as "estimated".
Availability (20%)
% of capacity probes in the last 24 hours that successfully returned a bookable instance. This replaces self-reported uptime numbers — what matters is whether you can actually get a GPU when you ask.
Reliability (10%)
Penalty for provider-acknowledged incidents in the trailing 30 days, weighted by severity. A clean record scores 100; five or more major incidents scores 0. Data sourced from provider status pages + our crowdsourced incident feed.
Known limitations
- Regional coverage. Not every provider publishes pricing per region. Where region is unknown we score at the provider level and flag the row.
- Performance proxies. Until Intelligence v3 (Phase P098) lands with OTel-based real-world telemetry, the perf component is primarily vendor-spec driven and will over-index on peak-theoretical throughput.
- Small-sample reliability. A newly tracked provider can't have a 30-day incident history. We default their reliability component to the corpus median until they accumulate enough data.
Reproducibility & anti-gaming
Scrape times are randomised within each tier's cadence window (see Phase P138 in our roadmap). Performance benchmarks are double-blind: the provider does not know when the benchmark worker is scheduled. See our Independence & Objectivity Policy for the full disclosure of revenue sources and provider relationships.
Changelog
- v1 (2026-04) — initial public release. 40/30/20/10 weights, 0–100 normalisation.