Falcon 180B
TII · dense · 180B parameters · 2,048 context
Parameters
180B
Context Window
2K tokens
Architecture
Dense
Best GPU
B200 SXM
Cheapest API
$2.40/M
Quality Score
60/100
Intelligence Brief
Falcon 180B is a 180B parameter DENSE model from TII, featuring Grouped Query Attention (GQA) with 80 layers and 14,848 hidden dimensions. With a 2,048 token context window, it supports code, multilingual. On standardized benchmarks, it achieves MMLU 68.6, HumanEval 33, GSM8K 55. The most cost-effective API deployment is via tii at $2.40/M output tokens. For self-hosted inference, B200 SXM delivers optimal throughput at $8522/month.
Architecture Details
Memory Requirements
BF16 Weights
360.0 GB
FP8 Weights
180.0 GB
INT4 Weights
90.0 GB
Fits on (single GPU) — most practical first
GPU Compatibility Matrix
Falcon 180B is compatible with 14% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
FP8 · 2 GPUs · tensorrt-llm
98/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$8522
Cost/M Tokens
$11.58
FP8 · 2 GPUs · tensorrt-llm
98/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$8541
Cost/M Tokens
$11.61
FP8 · 2 GPUs · tensorrt-llm
95/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$5106
Cost/M Tokens
$6.94
Deployment Options
API Deployment
tii
$2.40/M
output tokens
Single GPU
B200 NVL (pair)
$9965/mo
Min VRAM: 180 GB
Multi-GPU
B200 SXM x2
280.0 tok/s
TP· $8522/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| tii | $2.40 | $2.40 | Cheapest |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| tiiBest Value | $2.40 | $2.40 | $24 |
Cost per 1,000 Requests
Short (500 tok)
$1.68
via tii
Medium (2K tok)
$6.72
via tii
Long (8K tok)
$24.00
via tii
Performance Estimates
Throughput by GPU
VRAM Breakdown (B200 SXM, FP8)
Precision Impact
bf16
180.0 GB
weights/GPU
fp8
90.0 GB
weights/GPU
~280.0 tok/s
int4
45.0 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Falcon 180B
Self-Hosted Infrastructure
Similar Models
Gemini 1.5 Pro
175B params · moe
Quality: 80
from $5.00/M
Claude 3 Opus
175B params · dense
Quality: 80
from $75.00/M
Claude Opus 4
200B params · dense
Quality: 90
from $75.00/M
GPT-4o
200B params · moe
Quality: 85
from $10.00/M
GPT-4 Turbo
200B params · moe
Quality: 80
from $30.00/M
Frequently Asked Questions
How much VRAM does Falcon 180B need for inference?
Falcon 180B requires approximately 360.0 GB of VRAM at BF16 precision, 180.0 GB at FP8, or 90.0 GB at INT4 quantization. Additional VRAM is needed for KV-cache (163840 bytes per token) and activations (~4.00 GB).
What is the best GPU for Falcon 180B?
The top recommended GPU for Falcon 180B is the B200 SXM (x2) using FP8 precision. It achieves approximately 280.0 tokens/sec at an estimated cost of $8522/month ($11.58/M tokens). Score: 98/100.
How much does Falcon 180B inference cost?
Falcon 180B API inference starts from $2.40/M input tokens and $2.40/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.