Gemma 3 2B
Google · dense · 2B parameters · 8,192 context
Parameters
2B
Context Window
8K tokens
Architecture
Dense
Best GPU
RTX 4060
Quality Score
42/100
Intelligence Brief
Gemma 3 2B is a 2B parameter DENSE model from Google, featuring Grouped Query Attention (GQA) with 26 layers and 2,304 hidden dimensions. With a 8,192 token context window, it supports structured output, code, math, multilingual. On standardized benchmarks, it achieves MMLU 50, HumanEval 22, GSM8K 42. For self-hosted inference, RTX 4060 delivers optimal throughput at $209/month.
Architecture Details
Memory Requirements
BF16 Weights
4.0 GB
FP8 Weights
2.0 GB
INT4 Weights
1.0 GB
GPU Compatibility Matrix
Gemma 3 2B is compatible with 100% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 1 GPU · vllm
100/100
score
Throughput
330.5 tok/s
Latency (ITL)
3.0ms
Est. TTFT
1ms
Cost/Month
$209
Cost/M Tokens
$0.24
BF16 · 1 GPU · vllm
100/100
score
Throughput
544.3 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$85
Cost/M Tokens
$0.06
BF16 · 1 GPU · vllm
90/100
score
Throughput
544.3 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$161
Cost/M Tokens
$0.11
Deployment Options
API Deployment
No API pricing available
Single GPU
RTX 4060
$209/mo
Min VRAM: 2 GB
Multi-GPU
RTX 4060
330.5 tok/s
Best available config
API Pricing Comparison
No API pricing data available for this model.
Performance Estimates
Throughput by GPU
VRAM Breakdown (RTX 4060, BF16)
Precision Impact
bf16
4.0 GB
weights/GPU
~330.5 tok/s
fp8
2.0 GB
weights/GPU
int4
1.0 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Gemma 3 2B
Self-Hosted Infrastructure
Similar Models
Gemma 3 1B
1B params · dense
Quality: 35
Gemma 3 4B
4.3B params · dense
Quality: 54
from $0.10/M
Moondream 2B
1.86B params · dense
Quality: 50
Qwen 2 VL 2B
2.2B params · dense
Quality: 50
SeamlessM4T v2 Large
2.3B params · dense
Quality: 50
Frequently Asked Questions
How much VRAM does Gemma 3 2B need for inference?
Gemma 3 2B requires approximately 4.0 GB of VRAM at BF16 precision, 2.0 GB at FP8, or 1.0 GB at INT4 quantization. Additional VRAM is needed for KV-cache (26624 bytes per token) and activations (~0.30 GB).
What is the best GPU for Gemma 3 2B?
The top recommended GPU for Gemma 3 2B is the RTX 4060 using BF16 precision. It achieves approximately 330.5 tokens/sec at an estimated cost of $209/month ($0.24/M tokens). Score: 100/100.
How much does Gemma 3 2B inference cost?
Gemma 3 2B inference costs vary by provider and GPU setup. Use our calculator for detailed cost estimates across all providers.