Qwen 3 30B-A3B
Alibaba · moe · 30.5B parameters · 131,072 context
Parameters
30.5B
Context Window
128K tokens
Architecture
MoE
Best GPU
H200 SXM
Quality Score
70/100
Intelligence Brief
Qwen 3 30B-A3B is a 30.5B parameter Mixture-of-Experts (128 experts, 8 active) model from Alibaba, featuring Grouped Query Attention (GQA) with 48 layers and 2,048 hidden dimensions. With a 131,072 token context window, it supports tools, structured output, code, math, multilingual, reasoning. On standardized benchmarks, it achieves MMLU 75, HumanEval 48, GSM8K 80. For self-hosted inference, H200 SXM delivers optimal throughput at $2553/month.
Architecture Details
Memory Requirements
BF16 Weights
61.0 GB
FP8 Weights
30.5 GB
INT4 Weights
15.3 GB
GPU Compatibility Matrix
Qwen 3 30B-A3B is compatible with 62% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
1.1K tok/s
Latency (ITL)
1.0ms
Est. TTFT
0ms
Cost/Month
$2553
Cost/M Tokens
$0.93
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
1.1K tok/s
Latency (ITL)
1.0ms
Est. TTFT
0ms
Cost/Month
$1794
Cost/M Tokens
$0.65
FP8 · 1 GPU · tensorrt-llm
100/100
score
Throughput
1.1K tok/s
Latency (ITL)
1.0ms
Est. TTFT
0ms
Cost/Month
$1794
Cost/M Tokens
$0.65
Deployment Options
API Deployment
No API pricing available
Single GPU
H200 SXM
$2553/mo
Min VRAM: 31 GB
Multi-GPU
RTX A6000 x2
878.7 tok/s
TP· $930/mo
API Pricing Comparison
No API pricing data available for this model.
Performance Estimates
Throughput by GPU
VRAM Breakdown (H200 SXM, FP8)
Precision Impact
bf16
61.0 GB
weights/GPU
fp8
30.5 GB
weights/GPU
~1.1K tok/s
int4
15.3 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Qwen 3 30B-A3B
Self-Hosted Infrastructure
Similar Models
Qwen 3 32B
32.8B params · dense
Quality: 74
from $0.80/M
JAIS 30B
30B params · dense
Quality: 50
MPT 30B
30B params · dense
Quality: 48
Gemma 4 31B-IT
31B params · dense
Quality: 77
from $0.30/M
Qwen 2.5 32B
32.5B params · dense
Quality: 73
from $0.80/M
Frequently Asked Questions
How much VRAM does Qwen 3 30B-A3B need for inference?
Qwen 3 30B-A3B requires approximately 61.0 GB of VRAM at BF16 precision, 30.5 GB at FP8, or 15.3 GB at INT4 quantization. Additional VRAM is needed for KV-cache (24576 bytes per token) and activations (~0.50 GB).
What is the best GPU for Qwen 3 30B-A3B?
The top recommended GPU for Qwen 3 30B-A3B is the H200 SXM using FP8 precision. It achieves approximately 1.1K tokens/sec at an estimated cost of $2553/month ($0.93/M tokens). Score: 100/100.
How much does Qwen 3 30B-A3B inference cost?
Qwen 3 30B-A3B inference costs vary by provider and GPU setup. Use our calculator for detailed cost estimates across all providers.