Llama 3.2 90B Vision Instruct
Meta · dense · 88.8B parameters · 131,072 context
Parameters
88.8B
Context Window
128K tokens
Architecture
Dense
Best GPU
B200 SXM
Cheapest API
$1.20/M
Quality Score
84/100
Intelligence Brief
Llama 3.2 90B Vision Instruct is a 88.8B parameter DENSE model from Meta, featuring Grouped Query Attention (GQA) with 80 layers and 8,192 hidden dimensions. With a 131,072 token context window, it supports tools, vision, structured output, code, math, multilingual. On standardized benchmarks, it achieves MMLU 86, HumanEval 58, GSM8K 92. The most cost-effective API deployment is via together at $1.20/M output tokens. For self-hosted inference, B200 SXM delivers optimal throughput at $4261/month.
Architecture Details
Memory Requirements
BF16 Weights
177.6 GB
FP8 Weights
88.8 GB
INT4 Weights
44.4 GB
GPU Compatibility Matrix
Llama 3.2 90B Vision Instruct is compatible with 33% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
560.0 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$4261
Cost/M Tokens
$2.90
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
452.2 tok/s
Latency (ITL)
2.2ms
Est. TTFT
0ms
Cost/Month
$2553
Cost/M Tokens
$2.15
FP8 · 2 GPUs · tensorrt-llm
100/100
score
Throughput
560.0 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$3587
Cost/M Tokens
$2.44
Deployment Options
API Deployment
together
$1.20/M
output tokens
Single GPU
B200 SXM
$4261/mo
Min VRAM: 89 GB
Multi-GPU
H100 SXM x2
560.0 tok/s
TP· $3587/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| together | $1.20 | $1.20 | Cheapest |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| togetherBest Value | $1.20 | $1.20 | $12 |
Cost per 1,000 Requests
Short (500 tok)
$0.84
via together
Medium (2K tok)
$3.36
via together
Long (8K tok)
$12.00
via together
Performance Estimates
Throughput by GPU
VRAM Breakdown (B200 SXM, FP8)
Precision Impact
bf16
177.6 GB
weights/GPU
fp8
88.8 GB
weights/GPU
~560.0 tok/s
int4
44.4 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Llama 3.2 90B Vision Instruct
Self-Hosted Infrastructure
Similar Models
Llama 3.2 90B Vision
90B params · dense
Quality: 84
from $0.90/M
Inflection 3
100B params · dense
Quality: 74
from $15.00/M
YaLM 100B
100B params · dense
Quality: 50
Yi-Large
102.6B params · moe
Quality: 74
from $3.00/M
Command R+
104B params · dense
Quality: 68
from $2.00/M
Frequently Asked Questions
How much VRAM does Llama 3.2 90B Vision Instruct need for inference?
Llama 3.2 90B Vision Instruct requires approximately 177.6 GB of VRAM at BF16 precision, 88.8 GB at FP8, or 44.4 GB at INT4 quantization. Additional VRAM is needed for KV-cache (655360 bytes per token) and activations (~4.00 GB).
What is the best GPU for Llama 3.2 90B Vision Instruct?
The top recommended GPU for Llama 3.2 90B Vision Instruct is the B200 SXM using FP8 precision. It achieves approximately 560.0 tokens/sec at an estimated cost of $4261/month ($2.90/M tokens). Score: 100/100.
How much does Llama 3.2 90B Vision Instruct inference cost?
Llama 3.2 90B Vision Instruct API inference starts from $1.20/M input tokens and $1.20/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.