NVLM-D 72B
NVIDIA · dense · 72B parameters · 32,768 context
Parameters
72B
Context Window
32K tokens
Architecture
Dense
Best GPU
H200 SXM
Quality Score
79/100
Intelligence Brief
NVLM-D 72B is a 72B parameter DENSE model from NVIDIA, featuring Grouped Query Attention (GQA) with 80 layers and 8,192 hidden dimensions. With a 32,768 token context window, it supports tools, vision, structured output, code, math, multilingual, reasoning. On standardized benchmarks, it achieves MMLU 82, HumanEval 65. For self-hosted inference, H200 SXM delivers optimal throughput at $2553/month.
Architecture Details
Memory Requirements
BF16 Weights
144.0 GB
FP8 Weights
72.0 GB
INT4 Weights
36.0 GB
GPU Compatibility Matrix
NVLM-D 72B is compatible with 37% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
557.7 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$2553
Cost/M Tokens
$1.74
FP8 · 1 GPU · tensorrt-llm
98/100
score
Throughput
560.0 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$4261
Cost/M Tokens
$2.90
FP8 · 1 GPU · tensorrt-llm
98/100
score
Throughput
560.0 tok/s
Latency (ITL)
1.8ms
Est. TTFT
0ms
Cost/Month
$4271
Cost/M Tokens
$2.90
Deployment Options
API Deployment
No API pricing available
Single GPU
H200 SXM
$2553/mo
Min VRAM: 72 GB
Multi-GPU
H100 SXM x2
560.0 tok/s
TP· $3587/mo
API Pricing Comparison
No API pricing data available for this model.
Performance Estimates
Throughput by GPU
VRAM Breakdown (H200 SXM, FP8)
Precision Impact
bf16
144.0 GB
weights/GPU
fp8
72.0 GB
weights/GPU
~557.7 tok/s
int4
36.0 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy NVLM-D 72B
Self-Hosted Infrastructure
Similar Models
Dolphin 2.9 72B
72B params · dense
Quality: 50
Molmo 72B
72B params · dense
Quality: 78
Qwen 2.5 72B
72.7B params · dense
Quality: 77
from $0.90/M
Qwen 2.5 Math 72B
72.7B params · dense
Quality: 50
from $0.90/M
Qwen 2.5 VL 72B
72.7B params · dense
Quality: 50
from $0.90/M
Frequently Asked Questions
How much VRAM does NVLM-D 72B need for inference?
NVLM-D 72B requires approximately 144.0 GB of VRAM at BF16 precision, 72.0 GB at FP8, or 36.0 GB at INT4 quantization. Additional VRAM is needed for KV-cache (327680 bytes per token) and activations (~2.50 GB).
What is the best GPU for NVLM-D 72B?
The top recommended GPU for NVLM-D 72B is the H200 SXM using FP8 precision. It achieves approximately 557.7 tokens/sec at an estimated cost of $2553/month ($1.74/M tokens). Score: 100/100.
How much does NVLM-D 72B inference cost?
NVLM-D 72B inference costs vary by provider and GPU setup. Use our calculator for detailed cost estimates across all providers.