DINOv2 ViT-g/14 (NVIDIA-optimized)

Vision Embedding

NVIDIA · DINOv2 · v2.0 · released 2023-04-14

About

DINOv2 is a self-supervised vision foundation model — produces strong frozen image features that transfer to many downstream tasks without task-specific fine-tuning. The ViT-g/14 variant has 1.1B parameters and is currently the strongest open self-supervised vision encoder by published linear-probe benchmarks.

Intended use: Frozen feature extraction for downstream vision tasks (classification, segmentation, depth estimation, instance retrieval), as a vision tower for vision-language models, or as a backbone for fine-tuning on small labeled datasets.

Architecture

Type: encoder
Parameters: 1.1B
Layers: 40
Hidden dim: 1,536

Self-supervised vision transformer (ViT-g/14). Trained without labels via DINO (self-distillation with no labels) v2 on the LVD-142M curated image corpus. 14×14 patch size, 40 transformer blocks. Produces per-patch + CLS token embeddings that transfer to classification, segmentation, depth, and retrieval without fine-tuning. Originally published by Meta FAIR; NVIDIA serves an inference-optimized variant via NIM with FP8 quantization on H100.

Memory

Weights (BF16): 2.20 GB
Weights (FP8): 1.10 GB
Activation estimate: 0.60 GB

Pricing

Free — open weights

Self-host on your own GPU. The calculator surfaces GPU-hours cost on the hardware page instead of an API price.

Provenance

Source: huggingface.co
License: apache-2.0
Hugging Face: facebook/dinov2-giant
Last verified: 2026-06-25

visionself-supervisedembeddingsopen-weightvitfeature-extraction