NV-CLIP
Vision EmbeddingNVIDIA · NV-CLIP · v1.0 · released
About
NV-CLIP is NVIDIA's tuned variant of OpenAI CLIP, packaged as an NVIDIA NIM container for production embedding workloads. Optimized on H100/L40S with TensorRT-LLM and INT4/FP8 quantization recipes.
Intended use: Cross-modal retrieval, zero-shot image classification, multimodal RAG (image→text similarity), as a vision tower in downstream multimodal LLMs. Served via NIM.
Architecture
- Type
- encoder
- Parameters
- 428M
- Layers
- 24
- Hidden dim
- 1,024
Dual-encoder CLIP variant — separate ViT-L/14 image encoder and 12-layer text transformer trained contrastively against ~5B image-text pairs (LAION-derived plus NVIDIA-curated). Produces 1024-dim aligned image and text embeddings suitable for cross-modal retrieval, zero-shot classification, and as a frozen feature extractor for downstream multimodal models. Image encoder accepts 224×224 RGB; text encoder accepts up to 77 BPE tokens.
Memory
- Weights (BF16)
- 0.86 GB
- Weights (FP8)
- 0.43 GB
- Activation estimate
- 0.30 GB
Pricing
$0.040 per 1M input characters
Provenance
- Source
- catalog.ngc.nvidia.com
- License
- nvidia-open-model-license
- Last verified
- 2026-06-25