Skip to content

DINOv2 ViT-g/14 (NVIDIA-optimized)

Vision Embedding

NVIDIA · DINOv2 · v2.0 · released

About

DINOv2 is a self-supervised vision foundation model — produces strong frozen image features that transfer to many downstream tasks without task-specific fine-tuning. The ViT-g/14 variant has 1.1B parameters and is currently the strongest open self-supervised vision encoder by published linear-probe benchmarks.

Intended use: Frozen feature extraction for downstream vision tasks (classification, segmentation, depth estimation, instance retrieval), as a vision tower for vision-language models, or as a backbone for fine-tuning on small labeled datasets.

Architecture

Type
encoder
Parameters
1.1B
Layers
40
Hidden dim
1,536

Self-supervised vision transformer (ViT-g/14). Trained without labels via DINO (self-distillation with no labels) v2 on the LVD-142M curated image corpus. 14×14 patch size, 40 transformer blocks. Produces per-patch + CLS token embeddings that transfer to classification, segmentation, depth, and retrieval without fine-tuning. Originally published by Meta FAIR; NVIDIA serves an inference-optimized variant via NIM with FP8 quantization on H100.

Memory

Weights (BF16)
2.20 GB
Weights (FP8)
1.10 GB
Activation estimate
0.60 GB

Pricing

Free — open weights

Self-host on your own GPU. The calculator surfaces GPU-hours cost on the hardware page instead of an API price.

Provenance

License
apache-2.0
Last verified
2026-06-25
visionself-supervisedembeddingsopen-weightvitfeature-extraction