Whisper Medium
OpenAI · dense · 0.769B parameters · 448 context
Parameters
0.769B
Context Window
0K tokens
Architecture
Dense
Best GPU
RTX 4060
Intelligence Brief
Whisper Medium is a 0.769B parameter DENSE model from OpenAI, featuring Multi-Head Attention (MHA) with 24 layers and 1,024 hidden dimensions. With a 448 token context window, it supports multilingual. For self-hosted inference, RTX 4060 delivers optimal throughput at $209/month.
Architecture Details
Memory Requirements
BF16 Weights
1.5 GB
FP8 Weights
0.8 GB
INT4 Weights
0.4 GB
GPU Compatibility Matrix
Whisper Medium is compatible with 100% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 1 GPU · vllm
90/100
score
Throughput
907.2 tok/s
Latency (ITL)
1.1ms
Est. TTFT
0ms
Cost/Month
$209
Cost/M Tokens
$0.09
BF16 · 1 GPU · vllm
90/100
score
Throughput
1.5K tok/s
Latency (ITL)
0.7ms
Est. TTFT
0ms
Cost/Month
$85
Cost/M Tokens
$0.02
FP8 · 1 GPU · tensorrt-llm
83/100
score
Throughput
3.5K tok/s
Latency (ITL)
0.3ms
Est. TTFT
0ms
Cost/Month
$4261
Cost/M Tokens
$0.46
Deployment Options
API Deployment
No API pricing available
Single GPU
RTX 4060
$209/mo
Min VRAM: 1 GB
Multi-GPU
RTX 4060
907.2 tok/s
Best available config
API Pricing Comparison
No API pricing data available for this model.
Performance Estimates
Throughput by GPU
VRAM Breakdown (RTX 4060, BF16)
Precision Impact
bf16
1.5 GB
weights/GPU
~907.2 tok/s
fp8
0.8 GB
weights/GPU
int4
0.4 GB
weights/GPU
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Whisper Medium
Self-Hosted Infrastructure
Similar Models
Whisper Small
0.244B params · dense
Quality: 50
Whisper Large V3
1.55B params · dense
Quality: 50
from $0.01/M
Florence 2 Large
0.77B params · dense
Quality: 50
Parakeet CTC 0.6B
0.6B params · dense
Quality: 50
from $0.03/M
Qwen 3 0.6B
0.6B params · dense
Quality: 50
Frequently Asked Questions
How much VRAM does Whisper Medium need for inference?
Whisper Medium requires approximately 1.5 GB of VRAM at BF16 precision, 0.8 GB at FP8, or 0.4 GB at INT4 quantization. Additional VRAM is needed for KV-cache (98304 bytes per token) and activations (~0.20 GB).
What is the best GPU for Whisper Medium?
The top recommended GPU for Whisper Medium is the RTX 4060 using BF16 precision. It achieves approximately 907.2 tokens/sec at an estimated cost of $209/month ($0.09/M tokens). Score: 90/100.
How much does Whisper Medium inference cost?
Whisper Medium inference costs vary by provider and GPU setup. Use our calculator for detailed cost estimates across all providers.