Gemma 3 1B
Google · dense · 1B parameters · 32,768 context
Parameters
1B
Context Window
32K tokens
Architecture
Dense
Best GPU
RTX 3080
Quality Score
35/100
Intelligence Brief
Gemma 3 1B is a 1B parameter DENSE model from Google, featuring Grouped Query Attention (GQA) with 26 layers and 1,536 hidden dimensions. With a 32,768 token context window, it supports structured output, code, multilingual. On standardized benchmarks, it achieves MMLU 42, HumanEval 18, GSM8K 32. For self-hosted inference, RTX 3080 delivers optimal throughput at $133/month.
Architecture Details
Memory Requirements
BF16 Weights
2.0 GB
FP8 Weights
1.0 GB
INT4 Weights
0.5 GB
GPU Compatibility Matrix
Gemma 3 1B is compatible with 100% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 1 GPU · vllm
90/100
score
Throughput
2.1K tok/s
Latency (ITL)
0.5ms
Est. TTFT
0ms
Cost/Month
$133
Cost/M Tokens
$0.02
BF16 · 1 GPU · vllm
90/100
score
Throughput
734.4 tok/s
Latency (ITL)
1.4ms
Est. TTFT
0ms
Cost/Month
$209
Cost/M Tokens
$0.11
BF16 · 1 GPU · vllm
90/100
score
Throughput
1.2K tok/s
Latency (ITL)
0.8ms
Est. TTFT
0ms
Cost/Month
$85
Cost/M Tokens
$0.03
Deployment Options
API Deployment
No API pricing available
Single GPU
RTX 3080
$133/mo
Min VRAM: 1 GB
Multi-GPU
RTX 3080
2.1K tok/s
Best available config
API Pricing Comparison
No API pricing data available for this model.
Performance Estimates
Throughput by GPU
VRAM Breakdown (RTX 3080, BF16)
Precision Impact
bf16
2.0 GB
weights/GPU
~2.1K tok/s
fp8
1.0 GB
weights/GPU
int4
0.5 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Gemma 3 1B
Self-Hosted Infrastructure
Similar Models
Gemma 3 2B
2B params · dense
Quality: 42
Canary 1B
1B params · dense
Quality: 50
from $0.04/M
Llama Guard 3 1B
1B params · dense
Quality: 50
Falcon 3 1B
1B params · dense
Quality: 50
CSM-1B
1B params · dense
Quality: 50
Frequently Asked Questions
How much VRAM does Gemma 3 1B need for inference?
Gemma 3 1B requires approximately 2.0 GB of VRAM at BF16 precision, 1.0 GB at FP8, or 0.5 GB at INT4 quantization. Additional VRAM is needed for KV-cache (26624 bytes per token) and activations (~0.20 GB).
What is the best GPU for Gemma 3 1B?
The top recommended GPU for Gemma 3 1B is the RTX 3080 using BF16 precision. It achieves approximately 2.1K tokens/sec at an estimated cost of $133/month ($0.02/M tokens). Score: 90/100.
How much does Gemma 3 1B inference cost?
Gemma 3 1B inference costs vary by provider and GPU setup. Use our calculator for detailed cost estimates across all providers.