Llama 3.2 90B Vision
Meta · dense · 90B parameters · 131,072 context
Parameters
90B
Context Window
128K tokens
Architecture
Dense
Best GPU
B200 SXM
Cheapest API
$0.90/M
Quality Score
84/100
Intelligence Brief
Llama 3.2 90B Vision is a 90B parameter DENSE model from Meta, featuring Grouped Query Attention (GQA) with 80 layers and 8,192 hidden dimensions. With a 131,072 token context window, it supports tools, vision, structured output, code, math, multilingual. On standardized benchmarks, it achieves MMLU 86, HumanEval 58, GSM8K 92. The most cost-effective API deployment is via fireworks at $0.90/M output tokens. For self-hosted inference, B200 SXM delivers optimal throughput at $4261/month.
Architecture Details
Memory Requirements
BF16 Weights
180.0 GB
FP8 Weights
90.0 GB
INT4 Weights
45.0 GB
GPU Compatibility Matrix
Llama 3.2 90B Vision is compatible with 32% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
560.0 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$4261
Cost/M Tokens
$2.90
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
560.0 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$4271
Cost/M Tokens
$2.90
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
560.0 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$6169
Cost/M Tokens
$4.19
Deployment Options
API Deployment
fireworks
$0.90/M
output tokens
Single GPU
B200 SXM
$4261/mo
Min VRAM: 90 GB
Multi-GPU
H100 SXM x2
560.0 tok/s
TP· $3587/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| fireworks | $0.90 | $0.90 | Cheapest |
| together | $1.20 | $1.20 |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| fireworksBest Value | $0.90 | $0.90 | $9 |
| together | $1.20 | $1.20 | $12 |
Cost per 1,000 Requests
Short (500 tok)
$0.63
via fireworks
Medium (2K tok)
$2.52
via fireworks
Long (8K tok)
$9.00
via fireworks
Performance Estimates
Throughput by GPU
VRAM Breakdown (B200 SXM, FP8)
Precision Impact
bf16
180.0 GB
weights/GPU
fp8
90.0 GB
weights/GPU
~560.0 tok/s
int4
45.0 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Llama 3.2 90B Vision
Self-Hosted Infrastructure
Similar Models
Llama 3.2 90B Vision Instruct
88.8B params · dense
Quality: 84
from $1.20/M
Inflection 3
100B params · dense
Quality: 74
from $15.00/M
YaLM 100B
100B params · dense
Quality: 50
Yi-Large
102.6B params · moe
Quality: 74
from $3.00/M
Command R+
104B params · dense
Quality: 68
from $2.00/M
Frequently Asked Questions
How much VRAM does Llama 3.2 90B Vision need for inference?
Llama 3.2 90B Vision requires approximately 180.0 GB of VRAM at BF16 precision, 90.0 GB at FP8, or 45.0 GB at INT4 quantization. Additional VRAM is needed for KV-cache (327680 bytes per token) and activations (~3.00 GB).
What is the best GPU for Llama 3.2 90B Vision?
The top recommended GPU for Llama 3.2 90B Vision is the B200 SXM using FP8 precision. It achieves approximately 560.0 tokens/sec at an estimated cost of $4261/month ($2.90/M tokens). Score: 100/100.
How much does Llama 3.2 90B Vision inference cost?
Llama 3.2 90B Vision API inference starts from $0.90/M input tokens and $0.90/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.