Florence 2 Large
Microsoft · dense · 0.77B parameters · 2,048 context
Parameters
0.77B
Context Window
2K tokens
Architecture
Dense
Best GPU
RTX 4060
Intelligence Brief
Florence 2 Large is a 0.77B parameter DENSE model from Microsoft, featuring Multi-Head Attention (MHA) with 24 layers and 1,024 hidden dimensions. With a 2,048 token context window, it supports vision, structured output. For self-hosted inference, RTX 4060 delivers optimal throughput at $209/month.
Architecture Details
Memory Requirements
BF16 Weights
1.5 GB
FP8 Weights
0.8 GB
INT4 Weights
0.4 GB
GPU Compatibility Matrix
Florence 2 Large is compatible with 100% of GPU configurations across 41 GPUs at 3 precision levels.
GPU Recommendations
BF16 · 1 GPU · vllm
90/100
score
Throughput
906.0 tok/s
Latency (ITL)
1.1ms
Est. TTFT
0ms
Cost/Month
$209
Cost/M Tokens
$0.09
BF16 · 1 GPU · vllm
90/100
score
Throughput
1.5K tok/s
Latency (ITL)
0.7ms
Est. TTFT
0ms
Cost/Month
$85
Cost/M Tokens
$0.02
FP8 · 1 GPU · tensorrt-llm
83/100
score
Throughput
3.5K tok/s
Latency (ITL)
0.3ms
Est. TTFT
0ms
Cost/Month
$4261
Cost/M Tokens
$0.46
Deployment Options
API Deployment
No API pricing available
Single GPU
RTX 4060
$209/mo
Min VRAM: 1 GB
Multi-GPU
RTX 4060
906.0 tok/s
Best available config
API Pricing Comparison
No API pricing data available for this model.
Performance Estimates
Throughput by GPU
VRAM Breakdown (RTX 4060, BF16)
Precision Impact
bf16
1.5 GB
weights/GPU
~906.0 tok/s
fp8
0.8 GB
weights/GPU
int4
0.4 GB
weights/GPU
Capabilities
Features
Supported Frameworks
Supported Precisions
Where to Deploy Florence 2 Large
Self-Hosted Infrastructure
Similar Models
Whisper Medium
0.769B params · dense
Quality: 50
Parakeet CTC 0.6B
0.6B params · dense
Quality: 50
from $0.03/M
Qwen 3 0.6B
0.6B params · dense
Quality: 50
Jina Embeddings v3
0.57B params · dense
Quality: 50
from $0.01/M
BGE M3
0.568B params · dense
Quality: 50
from $0.01/M
Frequently Asked Questions
How much VRAM does Florence 2 Large need for inference?
Florence 2 Large requires approximately 1.5 GB of VRAM at BF16 precision, 0.8 GB at FP8, or 0.4 GB at INT4 quantization. Additional VRAM is needed for KV-cache (98304 bytes per token) and activations (~0.20 GB).
What is the best GPU for Florence 2 Large?
The top recommended GPU for Florence 2 Large is the RTX 4060 using BF16 precision. It achieves approximately 906.0 tokens/sec at an estimated cost of $209/month ($0.09/M tokens). Score: 90/100.
How much does Florence 2 Large inference cost?
Florence 2 Large inference costs vary by provider and GPU setup. Use our calculator for detailed cost estimates across all providers.