Whisper Large V3
OpenAI · dense · 1.55B parameters · 448 context
Parameters
1.55B
Context Window
0K tokens
Architecture
Dense
Best GPU
A4000
Cheapest API
$0.01/M
Intelligence Brief
Whisper Large V3 is a 1.55B parameter DENSE model from OpenAI, featuring Multi-Head Attention (MHA) with 32 layers and 1,280 hidden dimensions. With a 448 token context window, it supports multilingual. The most cost-effective API deployment is via openai at $0.01/M output tokens. For self-hosted inference, A4000 delivers optimal throughput at $161/month.
Architecture Details
Memory Requirements
BF16 Weights
3.1 GB
FP8 Weights
1.6 GB
INT4 Weights
0.8 GB
GPU Compatibility Matrix
Whisper Large V3 is compatible with 100% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 1 GPU · vllm
90/100
score
Throughput
741.3 tok/s
Latency (ITL)
1.3ms
Est. TTFT
0ms
Cost/Month
$161
Cost/M Tokens
$0.08
BF16 · 1 GPU · vllm
90/100
score
Throughput
1.2K tok/s
Latency (ITL)
0.8ms
Est. TTFT
0ms
Cost/Month
$304
Cost/M Tokens
$0.10
BF16 · 1 GPU · vllm
90/100
score
Throughput
834.0 tok/s
Latency (ITL)
1.2ms
Est. TTFT
0ms
Cost/Month
$237
Cost/M Tokens
$0.11
Deployment Options
API Deployment
openai
$0.01/M
output tokens
Single GPU
A4000
$161/mo
Min VRAM: 2 GB
Multi-GPU
A4000
741.3 tok/s
Best available config
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| openai | $0.01 | $0.01 | Cheapest |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| openaiBest Value | $0.01 | $0.01 | $0 |
Cost per 1,000 Requests
Short (500 tok)
$0.00
via openai
Medium (2K tok)
$0.02
via openai
Long (8K tok)
$0.06
via openai
Performance Estimates
Throughput by GPU
VRAM Breakdown (A4000, BF16)
Precision Impact
bf16
3.1 GB
weights/GPU
~741.3 tok/s
fp8
1.6 GB
weights/GPU
int4
0.8 GB
weights/GPU
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Whisper Large V3
Self-Hosted Infrastructure
Similar Models
Whisper Medium
0.769B params · dense
Quality: 50
Qwen 2.5 1.5B
1.5B params · dense
Quality: 50
Qwen 2.5 Coder 1.5B
1.5B params · dense
Quality: 40
DeepSeek R1 Distill 1.5B
1.5B params · dense
Quality: 42
SmolLM2 1.7B
1.7B params · dense
Quality: 50
Frequently Asked Questions
How much VRAM does Whisper Large V3 need for inference?
Whisper Large V3 requires approximately 3.1 GB of VRAM at BF16 precision, 1.6 GB at FP8, or 0.8 GB at INT4 quantization. Additional VRAM is needed for KV-cache (163840 bytes per token) and activations (~0.30 GB).
What is the best GPU for Whisper Large V3?
The top recommended GPU for Whisper Large V3 is the A4000 using BF16 precision. It achieves approximately 741.3 tokens/sec at an estimated cost of $161/month ($0.08/M tokens). Score: 90/100.
How much does Whisper Large V3 inference cost?
Whisper Large V3 API inference starts from $0.01/M input tokens and $0.01/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.