Phi 3 Medium 14B
Microsoft · dense · 14B parameters · 131,072 context
Parameters
14B
Context Window
128K tokens
Architecture
Dense
Best GPU
A100 40GB SXM
Quality Score
76/100
Intelligence Brief
Phi 3 Medium 14B is a 14B parameter DENSE model from Microsoft, featuring Grouped Query Attention (GQA) with 40 layers and 5,120 hidden dimensions. With a 131,072 token context window, it supports structured output, code, math. On standardized benchmarks, it achieves MMLU 78, HumanEval 55, GSM8K 86. For self-hosted inference, A100 40GB SXM delivers optimal throughput at $807/month.
Architecture Details
Memory Requirements
BF16 Weights
28.0 GB
FP8 Weights
14.0 GB
INT4 Weights
7.0 GB
GPU Compatibility Matrix
Phi 3 Medium 14B is compatible with 82% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 1 GPU · vllm
95/100
score
Throughput
299.9 tok/s
Latency (ITL)
3.3ms
Est. TTFT
1ms
Cost/Month
$807
Cost/M Tokens
$1.02
BF16 · 1 GPU · vllm
95/100
score
Throughput
148.1 tok/s
Latency (ITL)
6.8ms
Est. TTFT
1ms
Cost/Month
$465
Cost/M Tokens
$1.19
BF16 · 1 GPU · vllm
95/100
score
Throughput
134.2 tok/s
Latency (ITL)
7.5ms
Est. TTFT
1ms
Cost/Month
$399
Cost/M Tokens
$1.13
Deployment Options
API Deployment
No API pricing available
Single GPU
A100 40GB SXM
$807/mo
Min VRAM: 14 GB
Multi-GPU
RTX 3090 x2
298.7 tok/s
TP· $361/mo
API Pricing Comparison
No API pricing data available for this model.
Performance Estimates
Throughput by GPU
VRAM Breakdown (A100 40GB SXM, BF16)
Precision Impact
bf16
28.0 GB
weights/GPU
~299.9 tok/s
fp8
14.0 GB
weights/GPU
int4
7.0 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Phi 3 Medium 14B
Similar Models
Phi 3 Small 7B
7B params · dense
Quality: 72
Nekomata 14B
14B params · dense
Quality: 50
RWKV-6 14B
14.1B params · hybrid
Quality: 50
from $0.20/M
Qwen 1.5 MoE A2.7B
14.3B params · moe
Quality: 50
Phi-4
14.7B params · dense
Quality: 73
from $0.14/M
Frequently Asked Questions
How much VRAM does Phi 3 Medium 14B need for inference?
Phi 3 Medium 14B requires approximately 28.0 GB of VRAM at BF16 precision, 14.0 GB at FP8, or 7.0 GB at INT4 quantization. Additional VRAM is needed for KV-cache (204800 bytes per token) and activations (~1.50 GB).
What is the best GPU for Phi 3 Medium 14B?
The top recommended GPU for Phi 3 Medium 14B is the A100 40GB SXM using BF16 precision. It achieves approximately 299.9 tokens/sec at an estimated cost of $807/month ($1.02/M tokens). Score: 95/100.
How much does Phi 3 Medium 14B inference cost?
Phi 3 Medium 14B inference costs vary by provider and GPU setup. Use our calculator for detailed cost estimates across all providers.