o1
OpenAI · moe · 200B parameters · 200,000 context
Parameters
200B
Context Window
195K tokens
Architecture
MoE
Best GPU
B200 SXM
Cheapest API
$60.00/M
Quality Score
93/100
Intelligence Brief
o1 is a 200B parameter Mixture-of-Experts (16 experts, 2 active) model from OpenAI, featuring Grouped Query Attention (GQA) with 80 layers and 10,240 hidden dimensions. With a 200,000 token context window, it supports tools, vision, structured output, code, math, multilingual, reasoning. On standardized benchmarks, it achieves MMLU 92.3, HumanEval 83.4, GSM8K 98. The most cost-effective API deployment is via openai at $60.00/M output tokens. For self-hosted inference, B200 SXM delivers optimal throughput at $17044/month.
Architecture Details
Memory Requirements
BF16 Weights
400.0 GB
FP8 Weights
200.0 GB
INT4 Weights
100.0 GB
Fits on (single GPU) — most practical first
GPU Compatibility Matrix
o1 is compatible with 8% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 4 GPUs · tensorrt-llm
93/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$17044
Cost/M Tokens
$23.16
BF16 · 4 GPUs · tensorrt-llm
93/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$17082
Cost/M Tokens
$23.21
BF16 · 2 GPUs · tensorrt-llm
93/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$19929
Cost/M Tokens
$27.08
Deployment Options
API Deployment
openai
$60.00/M
output tokens
Single GPU
Requires multi-GPU setup (200 GB VRAM needed)
Multi-GPU
B200 SXM x4
280.0 tok/s
TP· $17044/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| openai | $15.00 | $60.00 | Cheapest |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| openaiBest Value | $15.00 | $60.00 | $375 |
Cost per 1,000 Requests
Short (500 tok)
$19.50
via openai
Medium (2K tok)
$78.00
via openai
Long (8K tok)
$240.00
via openai
Performance Estimates
Throughput by GPU
VRAM Breakdown (B200 SXM, BF16)
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy o1
Self-Hosted Infrastructure
Similar Models
o1-mini
70B params · dense
Quality: 83
from $12.00/M
Claude Opus 4
200B params · dense
Quality: 90
from $75.00/M
GPT-4o
200B params · moe
Quality: 85
from $10.00/M
GPT-4 Turbo
200B params · moe
Quality: 80
from $30.00/M
GLM-5
200B params · dense
Quality: 50
from $6.00/M
Frequently Asked Questions
How much VRAM does o1 need for inference?
o1 requires approximately 400.0 GB of VRAM at BF16 precision, 200.0 GB at FP8, or 100.0 GB at INT4 quantization. Additional VRAM is needed for KV-cache (204800 bytes per token) and activations (~4.00 GB).
What is the best GPU for o1?
The top recommended GPU for o1 is the B200 SXM (x4) using BF16 precision. It achieves approximately 280.0 tokens/sec at an estimated cost of $17044/month ($23.16/M tokens). Score: 93/100.
How much does o1 inference cost?
o1 API inference starts from $15.00/M input tokens and $60.00/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.