Grok 3
xAI · moe · 600B parameters · 131,072 context
Parameters
600B
Context Window
128K tokens
Architecture
MoE
Best GPU
B200 NVL (pair)
Cheapest API
$15.00/M
Quality Score
90/100
Intelligence Brief
Grok 3 is a 600B parameter Mixture-of-Experts (16 experts, 2 active) model from xAI, featuring Grouped Query Attention (GQA) with 96 layers and 12,288 hidden dimensions. With a 131,072 token context window, it supports tools, vision, structured output, code, math, multilingual, reasoning. On standardized benchmarks, it achieves MMLU 89, HumanEval 70, GSM8K 95. The most cost-effective API deployment is via xai at $15.00/M output tokens. For self-hosted inference, B200 NVL (pair) delivers optimal throughput at $39858/month.
Architecture Details
Memory Requirements
BF16 Weights
1200.0 GB
FP8 Weights
600.0 GB
INT4 Weights
300.0 GB
Fits on (single GPU) — most practical first
Fits on (multi-GPU with Tensor Parallelism)
Multi-GPU configurations use Tensor Parallelism (TP) to split model layers across GPUs. Requires NVLink or NVSwitch interconnect for optimal performance.
GPU Compatibility Matrix
Grok 3 is compatible with 1% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 4 GPUs · tensorrt-llm
68/100
score
Throughput
140.0 tok/s
Latency (ITL)
7.1ms
Est. TTFT
1ms
Cost/Month
$39858
Cost/M Tokens
$108.33
BF16 · 8 GPUs · vllm
65/100
score
Throughput
140.0 tok/s
Latency (ITL)
7.1ms
Est. TTFT
1ms
Cost/Month
$18904
Cost/M Tokens
$51.38
BF16 · 8 GPUs · tensorrt-llm
63/100
score
Throughput
140.0 tok/s
Latency (ITL)
7.1ms
Est. TTFT
1ms
Cost/Month
$34088
Cost/M Tokens
$92.65
Deployment Options
API Deployment
xai
$15.00/M
output tokens
Single GPU
Requires multi-GPU setup (600 GB VRAM needed)
Multi-GPU
B200 NVL (pair) x4
140.0 tok/s
TP· $39858/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| xai | $3.00 | $15.00 | Cheapest |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| xaiBest Value | $3.00 | $15.00 | $90 |
Cost per 1,000 Requests
Short (500 tok)
$4.50
via xai
Medium (2K tok)
$18.00
via xai
Long (8K tok)
$54.00
via xai
Performance Estimates
Throughput by GPU
VRAM Breakdown (B200 NVL (pair), BF16)
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Grok 3
Similar Models
Grok-2
314B params · moe
Quality: 78
from $10.00/M
Grok-3
314B params · dense
Quality: 91
from $15.00/M
Gemini 2.0 Pro
600B params · moe
Quality: 88
from $4.00/M
Megatron-Turing NLG 530B
530B params · dense
Quality: 58
DeepSeek R1
671B params · moe
Quality: 88
from $2.19/M
Frequently Asked Questions
How much VRAM does Grok 3 need for inference?
Grok 3 requires approximately 1200.0 GB of VRAM at BF16 precision, 600.0 GB at FP8, or 300.0 GB at INT4 quantization. Additional VRAM is needed for KV-cache (2359296 bytes per token) and activations (~10.00 GB).
What is the best GPU for Grok 3?
The top recommended GPU for Grok 3 is the B200 NVL (pair) (x4) using BF16 precision. It achieves approximately 140.0 tokens/sec at an estimated cost of $39858/month ($108.33/M tokens). Score: 68/100.
How much does Grok 3 inference cost?
Grok 3 API inference starts from $3.00/M input tokens and $15.00/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.