Skip to content

Methodology · v1

How we compute InferenceScore™

Last updated 2026-04-19. We publish this page (and the equivalent arXiv PDF) so that any third party can reproduce our scores from raw data. The open benchmark code lives at github.com/inferencebench/bench (MIT-licensed).

The formula

score = 0.40 * price_component
      + 0.30 * performance_component
      + 0.20 * availability_component
      + 0.10 * reliability_component

Each component is normalised to 0–100 against the current corpus of tracked GPU × provider × region rows.

Components

Price (40%)

On-demand $/hr from our scrapers (P039–P041). Inverted and scaled so the cheapest row in the corpus gets 100 and the most expensive gets 0. Spot / preemptible rates are tracked separately and reported as a second chart, not folded into the base score.

Performance (30%)

Tokens/sec on a reference workload (Llama-3-8B, BF16, 128 in / 128 out, batch size 1 and 32) measured by our open benchmark harness. Where live telemetry isn't available yet (most providers at launch) we fall back to the GPU vendor's published FP16 TFLOPS × a fixed multiplier, flagged on the detail page as "estimated".

Availability (20%)

% of capacity probes in the last 24 hours that successfully returned a bookable instance. This replaces self-reported uptime numbers — what matters is whether you can actually get a GPU when you ask.

Reliability (10%)

Penalty for provider-acknowledged incidents in the trailing 30 days, weighted by severity. A clean record scores 100; five or more major incidents scores 0. Data sourced from provider status pages + our crowdsourced incident feed.

Known limitations

Reproducibility & anti-gaming

Scrape times are randomised within each tier's cadence window (see Phase P138 in our roadmap). Performance benchmarks are double-blind: the provider does not know when the benchmark worker is scheduled. See our Independence & Objectivity Policy for the full disclosure of revenue sources and provider relationships.

Changelog