Llama 2 13B
Meta · dense · 13B parameters · 4,096 context
Parameters
13B
Context Window
4K tokens
Architecture
Dense
Best GPU
A100 40GB SXM
Quality Score
47/100
Intelligence Brief
Llama 2 13B is a 13B parameter DENSE model from Meta, featuring Multi-Head Attention (MHA) with 40 layers and 5,120 hidden dimensions. With a 4,096 token context window, it supports code. On standardized benchmarks, it achieves MMLU 55, HumanEval 20, GSM8K 35. For self-hosted inference, A100 40GB SXM delivers optimal throughput at $807/month.
Architecture Details
Memory Requirements
BF16 Weights
26.0 GB
FP8 Weights
13.0 GB
INT4 Weights
6.5 GB
GPU Compatibility Matrix
Llama 2 13B is compatible with 82% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 1 GPU · vllm
95/100
score
Throughput
322.9 tok/s
Latency (ITL)
3.1ms
Est. TTFT
1ms
Cost/Month
$807
Cost/M Tokens
$0.95
BF16 · 1 GPU · vllm
95/100
score
Throughput
159.5 tok/s
Latency (ITL)
6.3ms
Est. TTFT
1ms
Cost/Month
$465
Cost/M Tokens
$1.11
BF16 · 1 GPU · vllm
95/100
score
Throughput
144.5 tok/s
Latency (ITL)
6.9ms
Est. TTFT
1ms
Cost/Month
$399
Cost/M Tokens
$1.05
Deployment Options
API Deployment
No API pricing available
Single GPU
A100 40GB SXM
$807/mo
Min VRAM: 13 GB
Multi-GPU
RTX 3090 x2
319.5 tok/s
TP· $361/mo
API Pricing Comparison
No API pricing data available for this model.
Performance Estimates
Throughput by GPU
VRAM Breakdown (A100 40GB SXM, BF16)
Precision Impact
bf16
26.0 GB
weights/GPU
~322.9 tok/s
fp8
13.0 GB
weights/GPU
int4
6.5 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Llama 2 13B
Similar Models
Llama 2 7B
7B params · dense
Quality: 40
OLMo 2 13B
13B params · dense
Quality: 50
Baichuan 2 13B
13B params · dense
Quality: 50
from $0.25/M
Vicuna 13B
13B params · dense
Quality: 50
Code Llama 13B
13B params · dense
Quality: 44
from $0.22/M
Frequently Asked Questions
How much VRAM does Llama 2 13B need for inference?
Llama 2 13B requires approximately 26.0 GB of VRAM at BF16 precision, 13.0 GB at FP8, or 6.5 GB at INT4 quantization. Additional VRAM is needed for KV-cache (819200 bytes per token) and activations (~1.00 GB).
What is the best GPU for Llama 2 13B?
The top recommended GPU for Llama 2 13B is the A100 40GB SXM using BF16 precision. It achieves approximately 322.9 tokens/sec at an estimated cost of $807/month ($0.95/M tokens). Score: 95/100.
How much does Llama 2 13B inference cost?
Llama 2 13B inference costs vary by provider and GPU setup. Use our calculator for detailed cost estimates across all providers.