Jamba 1.5 Large
AI21 · hybrid · 398B parameters · 256,000 context
Parameters
398B
Context Window
250K tokens
Architecture
Dense
Best GPU
B200 NVL (pair)
Cheapest API
$8.00/M
Intelligence Brief
Jamba 1.5 Large is a 398B parameter HYBRID model from AI21, featuring Grouped Query Attention (GQA) with 64 layers and 8,192 hidden dimensions. With a 256,000 token context window, it supports tools, structured output, code, math, multilingual. The most cost-effective API deployment is via ai21 at $8.00/M output tokens. For self-hosted inference, B200 NVL (pair) delivers optimal throughput at $19929/month.
Architecture Details
Memory Requirements
BF16 Weights
796.0 GB
FP8 Weights
398.0 GB
INT4 Weights
199.0 GB
Fits on (single GPU) — most practical first
Fits on (multi-GPU with Tensor Parallelism)
Multi-GPU configurations use Tensor Parallelism (TP) to split model layers across GPUs. Requires NVLink or NVSwitch interconnect for optimal performance.
GPU Compatibility Matrix
Jamba 1.5 Large is compatible with 2% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
FP8 · 2 GPUs · tensorrt-llm
100/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$19929
Cost/M Tokens
$27.08
FP8 · 4 GPUs · tensorrt-llm
98/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$17044
Cost/M Tokens
$23.16
FP8 · 4 GPUs · tensorrt-llm
98/100
score
Throughput
280.0 tok/s
Latency (ITL)
3.6ms
Est. TTFT
1ms
Cost/Month
$17082
Cost/M Tokens
$23.21
Deployment Options
API Deployment
ai21
$8.00/M
output tokens
Single GPU
Requires multi-GPU setup (398 GB VRAM needed)
Multi-GPU
B200 NVL (pair) x2
280.0 tok/s
TP· $19929/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| ai21 | $2.00 | $8.00 | Cheapest |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| ai21Best Value | $2.00 | $8.00 | $50 |
Cost per 1,000 Requests
Short (500 tok)
$2.60
via ai21
Medium (2K tok)
$10.40
via ai21
Long (8K tok)
$32.00
via ai21
Performance Estimates
Throughput by GPU
VRAM Breakdown (B200 NVL (pair), FP8)
Precision Impact
bf16
398.0 GB
weights/GPU
fp8
199.0 GB
weights/GPU
~280.0 tok/s
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Jamba 1.5 Large
Similar Models
Llama 4 Maverick
400B params · moe
Quality: 84
from $1.80/M
Snowflake Arctic 128x3B
395B params · moe
Quality: 50
Llama 3.1 405B
405B params · dense
Quality: 81
from $3.00/M
Nemotron 340B
340B params · dense
Quality: 85
from $4.20/M
MiniMax-Text-01
456B params · moe
Quality: 50
from $5.00/M
Frequently Asked Questions
How much VRAM does Jamba 1.5 Large need for inference?
Jamba 1.5 Large requires approximately 796.0 GB of VRAM at BF16 precision, 398.0 GB at FP8, or 199.0 GB at INT4 quantization. Additional VRAM is needed for KV-cache (131072 bytes per token) and activations (~3.00 GB).
What is the best GPU for Jamba 1.5 Large?
The top recommended GPU for Jamba 1.5 Large is the B200 NVL (pair) (x2) using FP8 precision. It achieves approximately 280.0 tokens/sec at an estimated cost of $19929/month ($27.08/M tokens). Score: 100/100.
How much does Jamba 1.5 Large inference cost?
Jamba 1.5 Large API inference starts from $2.00/M input tokens and $8.00/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.