Recipe: verify and replay
Reproducibility is the product's moat. Anyone with a signed envelope and access to compatible hardware should be able to (a) verify the envelope's signature and content hash and (b) re-run the benchmark to produce a new envelope they can diff against the original.
1. Fetch
Grab the envelope you want to scrutinise. Hugging Face Hub, an HTTPS mirror, or a local file all work:
bench fetch hf://datasets/Yobitel/llama-3.1-8b__chatbot-short__abcdef123456
The fetched payload is validated against the Envelope schema before the command declares success. The local cache lives at ~/.cache/inferencebench/fetched/.
2. Verify
bench verify ~/.cache/inferencebench/fetched/3f9c1a2b8e7d.json
bench verify recomputes the content hash from the envelope minus the signature block, then validates the signature. Sigstore keyless and dev ed25519 keys are both supported. Any mismatch is a hard failure:
FAIL ~/.cache/inferencebench/fetched/3f9c1a2b8e7d.json
method: cosign-dev
reason: content hash mismatch (stored=60be8efd6d21..., recomputed=9c2f0a14...)
There are no warnings — verification either passes or it doesn't.
3. Replay
The envelope records every input needed to re-run the benchmark (suite id, model, engine, dataset, seed, quantization, SLO template). What it deliberately omits is the live engine endpoint, because that's host-specific. Point bench replay at your own engine:
bench replay ~/.cache/inferencebench/fetched/3f9c1a2b8e7d.json \
--base-url http://localhost:8000/v1 \
--output ./replay-results
The command verifies the source envelope first (refuses to replay an unverified envelope unless you pass --no-verify), spins up the same plugin configuration, and produces a new signed envelope. The replay summary table shows source vs. replay side by side for the identity fields and headline metrics.
4. Diff the replay
bench diff \
~/.cache/inferencebench/fetched/3f9c1a2b8e7d.json \
./replay-results/<hash>.json
If the replay landed on substantially different numbers, the diff table will surface it. Acceptable cross-host variation is a few-percent band; anything larger usually points at a hardware difference, an engine version mismatch, or a dataset hash drift — all of which the envelope captures and the diff context-match block will warn about.
Why this matters
Most "benchmarks" in the wild are screenshots. A signed envelope is a verifiable contract:
- The hardware fingerprint says exactly what silicon ran the test.
- The software provenance pins the engine version, CUDA toolkit, driver, kernel.
- The dataset hash makes "the harness used a different ShareGPT subset" detectable, not invisible.
- The Sigstore signature makes tampering with any of the above detectable.
Anyone who can run bench verify + bench replay + bench diff can independently check whether a published number holds up on their hardware.