Qwen 2.5 72B vs Llama 3.1 405B
Architecture Comparison
SpecQwen 2.5 72BLlama 3.1 405B
TypeDENSEDENSE
Total Parameters72.7B405B
Active Parameters72.7B405B
Layers80126
Hidden Dimension8,19216,384
Attention Heads64128
KV Heads88
Context Length131,072131,072
Precision (default)BF16BF16
Memory Requirements
PrecisionQwen 2.5 72BLlama 3.1 405B
BF16 Weights145.4 GB810.0 GB
FP8 Weights72.7 GB405.0 GB
INT4 Weights36.4 GB202.5 GB
KV-Cache / Token327680 B516096 B
Activation Estimate2.50 GB5.00 GB
Minimum GPUs Needed (BF16)
H100 SXM3 GPUsN/A
L40S4 GPUsN/A
Quality Benchmarks
BenchmarkQwen 2.5 72BLlama 3.1 405B
Overall8488
MMLU85.388.6
HumanEval56.061.0
GSM8K91.696.8
MT-Bench86.088.0
Qwen 2.5 72B
MMLU
85.3
HumanEval
56.0
GSM8K
91.6
MT-Bench
86.0
Llama 3.1 405B
MMLU
88.6
HumanEval
61.0
GSM8K
96.8
MT-Bench
88.0
Capabilities
FeatureQwen 2.5 72BLlama 3.1 405B
Tool Use✓ Yes✓ Yes
Vision✗ No✗ No
Code✓ Yes✓ Yes
Math✓ Yes✓ Yes
Reasoning✗ No✗ No
Multilingual✓ Yes✓ Yes
Structured Output✓ Yes✓ Yes
API Pricing Comparison
Cheapest Output (Qwen 2.5 72B)
$0.90/M
Input: $0.90/M
Cheapest Output (Llama 3.1 405B)
$3.00/M
Input: $3.00/M
| Provider | Qwen 2.5 72B In $/M | Out $/M | Llama 3.1 405B In $/M | Out $/M |
|---|---|---|---|---|
| together | $0.90 | $0.90 | $3.50 | $3.50 |
| fireworks | $0.90 | $0.90 | $3.00 | $3.00 |
Recommendation Summary
- ‣Llama 3.1 405B scores higher on overall quality (88 vs 84).
- ‣Qwen 2.5 72B is cheaper per output token ($0.90/M vs $3.00/M).
- ‣Qwen 2.5 72B has a smaller memory footprint (145.4 GB vs 810.0 GB BF16), making it easier to deploy on fewer GPUs.
- ‣Llama 3.1 405B is stronger at code generation (HumanEval: 61.0 vs 56.0).
- ‣Llama 3.1 405B is better at math reasoning (GSM8K: 96.8 vs 91.6).