Gemma 4 31B-IT
Google · dense · 31B parameters · 32,768 context
Parameters
31B
Context Window
32K tokens
Architecture
Dense
Best GPU
H20
Cheapest API
$0.30/M
Quality Score
77/100
Intelligence Brief
Gemma 4 31B-IT is a 31B parameter DENSE model from Google, featuring Grouped Query Attention (GQA) with 62 layers and 4,096 hidden dimensions. With a 32,768 token context window, it supports tools, vision, structured output, code, math, multilingual, reasoning. On standardized benchmarks, it achieves MMLU 83, HumanEval 68, GSM8K 89.5. The most cost-effective API deployment is via google at $0.30/M output tokens. For self-hosted inference, H20 delivers optimal throughput at $940/month.
Architecture Details
Memory Requirements
BF16 Weights
62.0 GB
FP8 Weights
31.0 GB
INT4 Weights
15.5 GB
GPU Compatibility Matrix
Gemma 4 31B-IT is compatible with 62% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
1.1K tok/s
Latency (ITL)
1.0ms
Est. TTFT
0ms
Cost/Month
$940
Cost/M Tokens
$0.34
FP8 · 1 GPU · tensorrt-llm
95/100
score
Throughput
1.1K tok/s
Latency (ITL)
1.0ms
Est. TTFT
0ms
Cost/Month
$2553
Cost/M Tokens
$0.93
FP8 · 1 GPU · tensorrt-llm
95/100
score
Throughput
904.0 tok/s
Latency (ITL)
1.1ms
Est. TTFT
0ms
Cost/Month
$1794
Cost/M Tokens
$0.75
Deployment Options
API Deployment
$0.30/M
output tokens
Single GPU
H20
$940/mo
Min VRAM: 31 GB
Multi-GPU
A10G x4
172.7 tok/s
TP· $1139/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| $0.15 | $0.30 | Cheapest |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| googleBest Value | $0.15 | $0.30 | $2 |
Cost per 1,000 Requests
Short (500 tok)
$0.14
via google
Medium (2K tok)
$0.54
via google
Long (8K tok)
$1.80
via google
Performance Estimates
Throughput by GPU
VRAM Breakdown (H20, FP8)
Precision Impact
bf16
62.0 GB
weights/GPU
fp8
31.0 GB
weights/GPU
~1.1K tok/s
int4
15.5 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Gemma 4 31B-IT
Self-Hosted Infrastructure
Similar Models
Qwen 3 30B-A3B
30.5B params · moe
Quality: 70
JAIS 30B
30B params · dense
Quality: 50
MPT 30B
30B params · dense
Quality: 48
Qwen 2.5 32B
32.5B params · dense
Quality: 73
from $0.80/M
Qwen 2.5 Coder 32B
32.5B params · dense
Quality: 80
from $0.80/M
Frequently Asked Questions
How much VRAM does Gemma 4 31B-IT need for inference?
Gemma 4 31B-IT requires approximately 62.0 GB of VRAM at BF16 precision, 31.0 GB at FP8, or 15.5 GB at INT4 quantization. Additional VRAM is needed for KV-cache (507904 bytes per token) and activations (~1.80 GB).
What is the best GPU for Gemma 4 31B-IT?
The top recommended GPU for Gemma 4 31B-IT is the H20 using FP8 precision. It achieves approximately 1.1K tokens/sec at an estimated cost of $940/month ($0.34/M tokens). Score: 100/100.
How much does Gemma 4 31B-IT inference cost?
Gemma 4 31B-IT API inference starts from $0.15/M input tokens and $0.30/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.