Skip to content

NV-CLIP

Vision Embedding

NVIDIA · NV-CLIP · v1.0 · released

About

NV-CLIP is NVIDIA's tuned variant of OpenAI CLIP, packaged as an NVIDIA NIM container for production embedding workloads. Optimized on H100/L40S with TensorRT-LLM and INT4/FP8 quantization recipes.

Intended use: Cross-modal retrieval, zero-shot image classification, multimodal RAG (image→text similarity), as a vision tower in downstream multimodal LLMs. Served via NIM.

Architecture

Type
encoder
Parameters
428M
Layers
24
Hidden dim
1,024

Dual-encoder CLIP variant — separate ViT-L/14 image encoder and 12-layer text transformer trained contrastively against ~5B image-text pairs (LAION-derived plus NVIDIA-curated). Produces 1024-dim aligned image and text embeddings suitable for cross-modal retrieval, zero-shot classification, and as a frozen feature extractor for downstream multimodal models. Image encoder accepts 224×224 RGB; text encoder accepts up to 77 BPE tokens.

Memory

Weights (BF16)
0.86 GB
Weights (FP8)
0.43 GB
Activation estimate
0.30 GB

Pricing

$0.040 per 1M input characters

Provenance

License
nvidia-open-model-license
Last verified
2026-06-25
visionembeddingsclipmultimodalretrievalzero-shotnim