GLM-5
Zhipu AI · dense · 200B parameters · 128,000 context
GLM-5 is a 200B parameter DENSE model from Zhipu AI, featuring a 128,000 token context window. With 80 transformer layers and a hidden dimension of 12,288, it delivers efficient Grouped Query Attention (GQA) for optimized inference throughput. Based on InferenceBench analysis, the optimal deployment configuration is the B200 SXM (x2) at FP8 precision, achieving approximately 280.0 tokens/second at $11.58/million tokens.
Architecture Details
Memory Requirements
BF16 Weights
400.0 GB
FP8 Weights
200.0 GB
INT4 Weights
100.0 GB
Fits on (single-node)
GPU Recommendations
FP8 · 2 GPUs · tensorrt-llm
98/100
score
Throughput
280.0 tok/s
Cost/Month
$8522
Cost/M Tokens
$11.58
FP8 · 2 GPUs · tensorrt-llm
98/100
score
Throughput
280.0 tok/s
Cost/Month
$8541
Cost/M Tokens
$11.61
FP8 · 2 GPUs · tensorrt-llm
95/100
score
Throughput
280.0 tok/s
Cost/Month
$5106
Cost/M Tokens
$6.94
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| zhipu | $2.00 | $6.00 | Cheapest |
Capabilities
Features
Supported Frameworks
Supported Precisions
Similar Models
Frequently Asked Questions
How much VRAM does GLM-5 need for inference?
GLM-5 requires approximately 400.0 GB of VRAM at BF16 precision, 200.0 GB at FP8, or 100.0 GB at INT4 quantization. Additional VRAM is needed for KV-cache (327680 bytes per token) and activations (~4.00 GB).
What is the best GPU for GLM-5?
The top recommended GPU for GLM-5 is the B200 SXM (x2) using FP8 precision. It achieves approximately 280.0 tokens/sec at an estimated cost of $8522/month ($11.58/M tokens). Score: 98/100.
How much does GLM-5 inference cost?
GLM-5 API inference starts from $2.00/M input tokens and $6.00/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.