Methodology · v1

How we compute InferenceScore™

Last updated 2026-04-19. We publish this page (and the equivalent arXiv PDF) so that any third party can reproduce our scores from raw data. The open benchmark code lives at github.com/inferencebench/bench (MIT-licensed).

The formula

score = 0.40 * price_component
      + 0.30 * performance_component
      + 0.20 * availability_component
      + 0.10 * reliability_component

Each component is normalised to 0–100 against the current corpus of tracked GPU × provider × region rows.

Components

Price (40%)

On-demand $/hr from our scrapers (P039–P041). Inverted and scaled so the cheapest row in the corpus gets 100 and the most expensive gets 0. Spot / preemptible rates are tracked separately and reported as a second chart, not folded into the base score.

Performance (30%)

Tokens/sec on a reference workload (Llama-3-8B, BF16, 128 in / 128 out, batch size 1 and 32) measured by our open benchmark harness. Where live telemetry isn't available yet (most providers at launch) we fall back to the GPU vendor's published FP16 TFLOPS × a fixed multiplier, flagged on the detail page as "estimated".

Availability (20%)

% of capacity probes in the last 24 hours that successfully returned a bookable instance. This replaces self-reported uptime numbers — what matters is whether you can actually get a GPU when you ask.

Reliability (10%)

Penalty for provider-acknowledged incidents in the trailing 30 days, weighted by severity. A clean record scores 100; five or more major incidents scores 0. Data sourced from provider status pages + our crowdsourced incident feed.

Known limitations

Regional coverage. Not every provider publishes pricing per region. Where region is unknown we score at the provider level and flag the row.
Performance proxies. Until Intelligence v3 (Phase P098) lands with OTel-based real-world telemetry, the perf component is primarily vendor-spec driven and will over-index on peak-theoretical throughput.
Small-sample reliability. A newly tracked provider can't have a 30-day incident history. We default their reliability component to the corpus median until they accumulate enough data.

Reproducibility & anti-gaming

Scrape times are randomised within each tier's cadence window (see Phase P138 in our roadmap). Performance benchmarks are double-blind: the provider does not know when the benchmark worker is scheduled. See our Independence & Objectivity Policy for the full disclosure of revenue sources and provider relationships.

Changelog

v1 (2026-04) — initial public release. 40/30/20/10 weights, 0–100 normalisation.