DINOv2 ViT-g/14 (NVIDIA-optimized)
Vision EmbeddingNVIDIA · DINOv2 · v2.0 · released
About
DINOv2 is a self-supervised vision foundation model — produces strong frozen image features that transfer to many downstream tasks without task-specific fine-tuning. The ViT-g/14 variant has 1.1B parameters and is currently the strongest open self-supervised vision encoder by published linear-probe benchmarks.
Intended use: Frozen feature extraction for downstream vision tasks (classification, segmentation, depth estimation, instance retrieval), as a vision tower for vision-language models, or as a backbone for fine-tuning on small labeled datasets.
Architecture
- Type
- encoder
- Parameters
- 1.1B
- Layers
- 40
- Hidden dim
- 1,536
Self-supervised vision transformer (ViT-g/14). Trained without labels via DINO (self-distillation with no labels) v2 on the LVD-142M curated image corpus. 14×14 patch size, 40 transformer blocks. Produces per-patch + CLS token embeddings that transfer to classification, segmentation, depth, and retrieval without fine-tuning. Originally published by Meta FAIR; NVIDIA serves an inference-optimized variant via NIM with FP8 quantization on H100.
Memory
- Weights (BF16)
- 2.20 GB
- Weights (FP8)
- 1.10 GB
- Activation estimate
- 0.60 GB
Pricing
Free — open weights
Self-host on your own GPU. The calculator surfaces GPU-hours cost on the hardware page instead of an API price.
Provenance
- Source
- huggingface.co
- License
- apache-2.0
- Hugging Face
- facebook/dinov2-giant
- Last verified
- 2026-06-25