Phi-4
Microsoft · dense · 14.7B parameters · 16,384 context
Parameters
14.7B
Context Window
16K tokens
Architecture
Dense
Best GPU
A100 40GB SXM
Cheapest API
$0.14/M
Quality Score
73/100
Intelligence Brief
Phi-4 is a 14.7B parameter DENSE model from Microsoft, featuring Grouped Query Attention (GQA) with 40 layers and 5,120 hidden dimensions. With a 16,384 token context window, it supports tools, structured output, code, math, multilingual, reasoning. On standardized benchmarks, it achieves MMLU 84.8, HumanEval 67, GSM8K 93. The most cost-effective API deployment is via azure at $0.14/M output tokens. For self-hosted inference, A100 40GB SXM delivers optimal throughput at $807/month.
Architecture Details
Memory Requirements
BF16 Weights
29.4 GB
FP8 Weights
14.7 GB
INT4 Weights
7.3 GB
GPU Compatibility Matrix
Phi-4 is compatible with 82% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 1 GPU · vllm
95/100
score
Throughput
285.6 tok/s
Latency (ITL)
3.5ms
Est. TTFT
1ms
Cost/Month
$807
Cost/M Tokens
$1.07
BF16 · 1 GPU · vllm
95/100
score
Throughput
141.1 tok/s
Latency (ITL)
7.1ms
Est. TTFT
1ms
Cost/Month
$465
Cost/M Tokens
$1.25
BF16 · 1 GPU · vllm
95/100
score
Throughput
127.8 tok/s
Latency (ITL)
7.8ms
Est. TTFT
1ms
Cost/Month
$399
Cost/M Tokens
$1.19
Deployment Options
API Deployment
azure
$0.14/M
output tokens
Single GPU
A100 40GB SXM
$807/mo
Min VRAM: 15 GB
Multi-GPU
RTX 3090 x2
285.6 tok/s
TP· $361/mo
API Pricing Comparison
| Provider | Input $/M | Output $/M | Badges |
|---|---|---|---|
| azure | $0.07 | $0.14 | Cheapest |
| together | $0.20 | $0.20 |
Cost Analysis
| Provider | Input $/M | Output $/M | ~Monthly Cost |
|---|---|---|---|
| azureBest Value | $0.07 | $0.14 | $1 |
| together | $0.20 | $0.20 | $2 |
Cost per 1,000 Requests
Short (500 tok)
$0.06
via azure
Medium (2K tok)
$0.25
via azure
Long (8K tok)
$0.84
via azure
Performance Estimates
Throughput by GPU
VRAM Breakdown (A100 40GB SXM, BF16)
Precision Impact
bf16
29.4 GB
weights/GPU
~285.6 tok/s
fp8
14.7 GB
weights/GPU
int4
7.3 GB
weights/GPU
Quality Benchmarks
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Phi-4
Similar Models
Phi 3.5 MoE
41.9B params · moe
Quality: 74
Qwen 2.5 Coder 14B
14.7B params · dense
Quality: 50
from $0.30/M
Qwen 2.5 14B
14.8B params · dense
Quality: 76
from $0.30/M
DeepSeek R1 Distill 14B
14.8B params · dense
Quality: 88
from $0.30/M
Nemotron 15B
15B params · dense
Quality: 72
from $0.30/M
Frequently Asked Questions
How much VRAM does Phi-4 need for inference?
Phi-4 requires approximately 29.4 GB of VRAM at BF16 precision, 14.7 GB at FP8, or 7.3 GB at INT4 quantization. Additional VRAM is needed for KV-cache (204800 bytes per token) and activations (~1.50 GB).
What is the best GPU for Phi-4?
The top recommended GPU for Phi-4 is the A100 40GB SXM using BF16 precision. It achieves approximately 285.6 tokens/sec at an estimated cost of $807/month ($1.07/M tokens). Score: 95/100.
How much does Phi-4 inference cost?
Phi-4 API inference starts from $0.07/M input tokens and $0.14/M output tokens. Self-hosted inference costs depend on your GPU configuration — use our ROI calculator for a detailed breakdown.