Llama 3.1 405B
Meta · dense · 405B parameters · 131,072 context
Parameters
405B
Context Window
128K tokens
Architecture
Dense
Best GPU
B200 NVL (pair)
Cheapest API
$1.00/M
Quality Score
81/100
Intelligence Brief
Llama 3.1 405B is a 405B parameter DENSE model from Meta, featuring Grouped Query Attention (GQA) with 126 layers and 16,384 hidden dimensions. With a 131,072 token context window, it supports tools, structured output, code, math, multilingual. On standardized benchmarks, it achieves MMLU 88.6, HumanEval 61, GSM8K 96.8. The most cost-effective API deployment is via openrouter at $1.00/M output tokens. For self-hosted inference, B200 NVL (pair) delivers optimal throughput at $19929/month.
Provider pricing
3 providers · canonical: together| Provider | Input $/M | Output $/M ▲ | Notes |
|---|---|---|---|
| openrouter | $1.00 | $1.00 | cheapest input · cheapest output |
| fireworks | $3.00 | $3.00 | — |
| togethercanonical | $3.50 | $3.50 | — |
Prices update via the nightly pricing cron + admin approvals at /admin/ingest-queue. The leaderboard's Input/Output cells show the canonical rate above; this table shows the full spread.
Recent changes
Loading…
Related models
5 suggestions
Llama 3.1 70BLlama 3.1 · 70.6Bfree/M out
Llama 3.1 8BLlama 3.1 · 8.03Bfree/M out
HelpSteer2 Llama 3.1 70BLlama 3.1 · 70.6B$0.400/M out
Llama 3.1 Nemotron 51BLlama 3.1 · 51B$0.400/M out
Llama 3.1 Nemotron 70B InstructLlama 3.1 · 70.6B$0.880/M out
Picks: same family first, then same vendor within ±2× params, then top tag-overlap matches. Price shown is the cheapest Output $/M across providers — the row's page shows the canonical anchor.
Architecture Details
Memory Requirements
BF16 Weights
810.0 GB
FP8 Weights
405.0 GB
INT4 Weights
202.5 GB
Fits on (single GPU) — most practical first
Fits on (multi-GPU with Tensor Parallelism)
Multi-GPU configurations use Tensor Parallelism (TP) to split model layers across GPUs. Requires NVLink or NVSwitch interconnect for optimal performance.
GPU Compatibility Matrix
Llama 3.1 405B is compatible with 2% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
FP8 · 2 GPUs · tensorrt-llm
88/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$19929
Cost/M Tokens
$27.08
FP8 · 8 GPUs · tensorrt-llm
85/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$7516
Cost/M Tokens
$10.21
FP8 · 4 GPUs · tensorrt-llm
83/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$17044
Cost/M Tokens
$23.16
Deployment Options
API Deployment
openrouter
$1.00/M
output tokens
Single GPU
Requires multi-GPU setup (405 GB VRAM needed)
Multi-GPU
B200 NVL (pair) x2
280.0 tok/s
TP· $19929/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| openrouter | $1.00 | $1.00 | Cheapest |
| fireworks | $3.00 | $3.00 | |
| together | $3.50 | $3.50 |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| openrouterBest Value | $1.00 | $1.00 | $10 |
| fireworks | $3.00 | $3.00 | $30 |
| together | $3.50 | $3.50 | $35 |
Cost per 1,000 Requests
Short (500 tok)
$0.70
via openrouter
Medium (2K tok)
$2.80
via openrouter
Long (8K tok)
$10.00
via openrouter
Performance Estimates
Throughput by GPU
VRAM Breakdown (B200 NVL (pair), FP8)
Precision Impact
bf16
405.0 GB
weights/GPU
fp8
202.5 GB
weights/GPU
~280.0 tok/s
int4
101.3 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Llama 3.1 405B
Similar Models
Llama 4 Maverick
400B params · moe
Quality: 84
from $0.60/M
Claude Opus 4.5
400B params · moe
Quality: 90
from $25.00/M
Grok 4
400B params · moe
Quality: 50
Jamba 1.5 Large
398B params · hybrid
Quality: 50
from $8.00/M
Snowflake Arctic 128x3B
395B params · moe
Quality: 50
Frequently Asked Questions
How much VRAM does Llama 3.1 405B need for inference?
Llama 3.1 405B requires approximately 810.0 GB of VRAM at BF16 precision, 405.0 GB at FP8, or 202.5 GB at INT4 quantization. Additional VRAM is needed for KV-cache (516096 bytes per token) and activations (~5.00 GB).
What is the best GPU for Llama 3.1 405B?
The top recommended GPU for Llama 3.1 405B is the B200 NVL (pair) (x2) using FP8 precision. It achieves approximately 280.0 tokens/sec at an estimated cost of $19929/month ($27.08/M tokens). Score: 88/100.
How much does Llama 3.1 405B inference cost?
Llama 3.1 405B API inference starts from $1.00/M input tokens and $1.00/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.