Gemma 4 31B-IT
Google · dense · 31B parameters · 32,768 context
Parameters
31B
Context Window
32K tokens
Architecture
Dense
Best GPU
H20
Cheapest API
$0.00/M
Quality Score
77/100
Intelligence Brief
Gemma 4 31B-IT is a 31B parameter DENSE model from Google, featuring Grouped Query Attention (GQA) with 62 layers and 4,096 hidden dimensions. With a 32,768 token context window, it supports tools, vision, structured output, code, math, multilingual, reasoning. On standardized benchmarks, it achieves MMLU 83, HumanEval 68, GSM8K 89.5. The most cost-effective API deployment is via featherless at $0.00/M output tokens. For self-hosted inference, H20 delivers optimal throughput at $940/month.
Provider pricing
6 providers · canonical: google| Provider | Input $/M | Output $/M ▲ | Notes |
|---|---|---|---|
| featherless | free | free | cheapest input · cheapest output |
| googlecanonical | $0.150 | $0.300 | — |
| openrouter | $0.120 | $0.370 | — |
| deepinfra | $0.130 | $0.380 | — |
| novita | $0.140 | $0.400 | — |
| together | $0.200 | $0.500 | — |
Prices update via the nightly pricing cron + admin approvals at /admin/ingest-queue. The leaderboard's Input/Output cells show the canonical rate above; this table shows the full spread.
Recent changes
Loading…
Related models
5 suggestions
Gemma 2 27BGemma 2 · 27Bfree/M out
Gemma 3 27BGemma 3 · 27Bfree/M out
Gemini 1.5 ProGemini · 40B$5.00/M out
InternLM3 8BInternLM · 8B—
Nemotron-3 Super 120BNemotron · 120B$0.450/M out
Picks: same family first, then same vendor within ±2× params, then top tag-overlap matches. Price shown is the cheapest Output $/M across providers — the row's page shows the canonical anchor.
Architecture Details
Memory Requirements
BF16 Weights
62.0 GB
FP8 Weights
31.0 GB
INT4 Weights
15.5 GB
GPU Compatibility Matrix
Gemma 4 31B-IT is compatible with 62% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
1.1K tok/s
Latency (ITL)
1.0ms
Est. TTFT
0ms
Cost/Month
$940
Cost/M Tokens
$0.34
FP8 · 1 GPU · tensorrt-llm
95/100
score
Throughput
1.1K tok/s
Latency (ITL)
1.0ms
Est. TTFT
0ms
Cost/Month
$2553
Cost/M Tokens
$0.93
FP8 · 1 GPU · tensorrt-llm
95/100
score
Throughput
904.0 tok/s
Latency (ITL)
1.1ms
Est. TTFT
0ms
Cost/Month
$1794
Cost/M Tokens
$0.75
Deployment Options
API Deployment
featherless
$0.00/M
output tokens
Single GPU
H20
$940/mo
Min VRAM: 31 GB
Multi-GPU
A10G x4
172.7 tok/s
TP· $1139/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| featherless | $0.00 | $0.00 | Cheapest |
| $0.15 | $0.30 | ||
| openrouter | $0.12 | $0.37 | |
| deepinfra | $0.13 | $0.38 | |
| novita | $0.14 | $0.40 | |
| together | $0.20 | $0.50 |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| featherlessBest Value | $0.00 | $0.00 | $0 |
| $0.15 | $0.30 | $2 | |
| openrouter | $0.12 | $0.37 | $2 |
| deepinfra | $0.13 | $0.38 | $3 |
| novita | $0.14 | $0.40 | $3 |
| together | $0.20 | $0.50 | $4 |
Cost per 1,000 Requests
Short (500 tok)
$0.00
via featherless
Medium (2K tok)
$0.00
via featherless
Long (8K tok)
$0.00
via featherless
Performance Estimates
Throughput by GPU
VRAM Breakdown (H20, FP8)
Precision Impact
bf16
62.0 GB
weights/GPU
fp8
31.0 GB
weights/GPU
~1.1K tok/s
int4
15.5 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Gemma 4 31B-IT
Self-Hosted Infrastructure
Similar Models
Qwen 3 30B-A3B
30.5B params · moe
Quality: 70
from $0.45/M
JAIS 30B
30B params · dense
Quality: 50
MPT 30B
30B params · dense
Quality: 48
Claude Haiku 4.5
30B params · moe
Quality: 50
from $5.00/M
Qwen 2.5 32B
32.5B params · dense
Quality: 73
from $0.80/M
Frequently Asked Questions
How much VRAM does Gemma 4 31B-IT need for inference?
Gemma 4 31B-IT requires approximately 62.0 GB of VRAM at BF16 precision, 31.0 GB at FP8, or 15.5 GB at INT4 quantization. Additional VRAM is needed for KV-cache (507904 bytes per token) and activations (~1.80 GB).
What is the best GPU for Gemma 4 31B-IT?
The top recommended GPU for Gemma 4 31B-IT is the H20 using FP8 precision. It achieves approximately 1.1K tokens/sec at an estimated cost of $940/month ($0.34/M tokens). Score: 100/100.
How much does Gemma 4 31B-IT inference cost?
Gemma 4 31B-IT API inference starts from $0.00/M input tokens and $0.00/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.