Advanced Multimodal Clip
Powerful skill for openai, model, connecting, vision. Includes structured workflows, validation checks, and reusable patterns for ai research.
Advanced Multimodal Development with CLIP
Overview
CLIP (Contrastive Language-Image Pre-Training) is OpenAI's foundational multimodal model that bridges visual and textual understanding through contrastive learning on 400 million image-text pairs. CLIP's key innovation is learning a shared embedding space where images and text descriptions can be directly compared via cosine similarity, enabling zero-shot image classification, semantic image search, content moderation, and cross-modal retrieval without task-specific training data. This guide covers advanced CLIP usage from efficient embedding pipelines through fine-tuning with OpenCLIP, production-scale image search systems, and integration with downstream models like LLaVA and Stable Diffusion.
When to Use
- Zero-shot image classification: You want to classify images into arbitrary categories without any labeled training data
- Semantic image search: You need to search an image corpus using natural language queries
- Cross-modal retrieval: You need to find images matching text descriptions or text matching image content
- Content moderation: You want to detect NSFW, violent, or otherwise inappropriate visual content using text-defined categories
- Image similarity: You need to compute semantic similarity between images (by comparing their embeddings)
- Feature extraction for downstream models: You need visual features as input to other models (LLaVA, image generation guidance)
- Multi-label image tagging: You want to assign multiple relevant labels to images based on text descriptions
Choose alternatives when: you need image generation (use Stable Diffusion), detailed image captioning (use BLIP-2), visual question answering with long responses (use LLaVA), or fine-grained spatial understanding (use grounding models like GroundingDINO).
Quick Start
# Install OpenAI's CLIP pip install git+https://github.com/openai/CLIP.git pip install torch torchvision Pillow # Or install OpenCLIP (community version with more models) pip install open-clip-torch
import torch import clip from PIL import Image # Load CLIP model device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) # Zero-shot classification image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device) labels = ["a photograph of a dog", "a photograph of a cat", "a photograph of a car"] text = clip.tokenize(labels).to(device) with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) # Normalize for cosine similarity image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # Compute similarity similarity = (image_features @ text_features.T).squeeze(0) probs = similarity.softmax(dim=-1) for label, prob in zip(labels, probs): print(f"{label}: {prob.item():.1%}")
Core Concepts
CLIP Architecture Deep Dive
# CLIP consists of two parallel encoders trained jointly # Vision Encoder: processes image into embedding vector # Options: ResNet (RN50, RN101) or Vision Transformer (ViT-B/32, ViT-B/16, ViT-L/14) # # Image (224x224) --> Patch Embedding (for ViT) --> Transformer Layers --> [CLS] token --> Projection --> Embedding (512/768d) # # Text Encoder: processes text into embedding vector # Architecture: Transformer with causal (GPT-style) attention # # Text --> BPE Tokenize --> Transformer Layers --> [EOS] token --> Projection --> Embedding (512/768d) # # Training objective: contrastive loss (InfoNCE) # For a batch of N image-text pairs: # - N correct pairs should have high similarity # - N^2 - N incorrect pairs should have low similarity # The learned temperature parameter scales the similarity scores # logits = (image_features @ text_features.T) * exp(temperature)
Model Variants and Selection
import clip # List all available models print(clip.available_models()) # ['RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px'] # Model comparison models = { "ViT-B/32": { "parameters": "151M", "embedding_dim": 512, "image_resolution": 224, "speed": "fastest", "quality": "good", "use_case": "Real-time search, rapid prototyping" }, "ViT-B/16": { "parameters": "150M", "embedding_dim": 512, "image_resolution": 224, "speed": "medium", "quality": "better", "use_case": "Balanced speed/quality" }, "ViT-L/14": { "parameters": "428M", "embedding_dim": 768, "image_resolution": 224, "speed": "slower", "quality": "best", "use_case": "Maximum accuracy, offline processing" }, "ViT-L/14@336px": { "parameters": "428M", "embedding_dim": 768, "image_resolution": 336, "speed": "slowest", "quality": "highest", "use_case": "Fine detail recognition" } } # OpenCLIP provides even more variants import open_clip print(open_clip.list_pretrained()) # 100+ model/dataset combinations # Load OpenCLIP model (better quality than original CLIP) model, _, preprocess = open_clip.create_model_and_transforms( 'ViT-H-14', pretrained='laion2b_s32b_b79k' ) tokenizer = open_clip.get_tokenizer('ViT-H-14')
Production-Scale Image Search
import torch import clip import numpy as np import faiss from PIL import Image from pathlib import Path class CLIPImageSearch: """Production-ready image search engine using CLIP + FAISS.""" def __init__(self, model_name="ViT-L/14", device="cuda"): self.device = device self.model, self.preprocess = clip.load(model_name, device=device) self.embedding_dim = 768 if "L/14" in model_name else 512 # FAISS index for fast similarity search self.index = faiss.IndexFlatIP(self.embedding_dim) # Inner product (cosine after normalization) self.image_paths = [] def index_images(self, image_dir, batch_size=64): """Index all images in a directory.""" paths = list(Path(image_dir).glob("**/*.jpg")) + \ list(Path(image_dir).glob("**/*.png")) all_embeddings = [] for i in range(0, len(paths), batch_size): batch_paths = paths[i:i + batch_size] images = [] valid_paths = [] for path in batch_paths: try: img = self.preprocess(Image.open(path)).unsqueeze(0) images.append(img) valid_paths.append(str(path)) except Exception: continue if images: batch = torch.cat(images).to(self.device) with torch.no_grad(): embeddings = self.model.encode_image(batch) embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True) all_embeddings.append(embeddings.cpu().numpy()) self.image_paths.extend(valid_paths) if all_embeddings: embeddings_np = np.vstack(all_embeddings).astype('float32') self.index.add(embeddings_np) print(f"Indexed {self.index.ntotal} images") def search_by_text(self, query, top_k=10): """Search images using a text query.""" text = clip.tokenize([query]).to(self.device) with torch.no_grad(): text_embedding = self.model.encode_text(text) text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True) query_np = text_embedding.cpu().numpy().astype('float32') scores, indices = self.index.search(query_np, top_k) return [ {"path": self.image_paths[idx], "score": float(score)} for score, idx in zip(scores[0], indices[0]) if idx < len(self.image_paths) ] def search_by_image(self, image_path, top_k=10): """Search for visually similar images.""" image = self.preprocess(Image.open(image_path)).unsqueeze(0).to(self.device) with torch.no_grad(): image_embedding = self.model.encode_image(image) image_embedding = image_embedding / image_embedding.norm(dim=-1, keepdim=True) query_np = image_embedding.cpu().numpy().astype('float32') scores, indices = self.index.search(query_np, top_k) return [ {"path": self.image_paths[idx], "score": float(score)} for score, idx in zip(scores[0], indices[0]) if idx < len(self.image_paths) ] def save_index(self, path): """Save FAISS index and metadata to disk.""" faiss.write_index(self.index, f"{path}/clip.index") with open(f"{path}/paths.json", "w") as f: import json json.dump(self.image_paths, f) def load_index(self, path): """Load FAISS index and metadata from disk.""" self.index = faiss.read_index(f"{path}/clip.index") with open(f"{path}/paths.json") as f: import json self.image_paths = json.load(f) # Usage search_engine = CLIPImageSearch(model_name="ViT-L/14") search_engine.index_images("images/", batch_size=64) results = search_engine.search_by_text("a sunset over the mountains")
Content Moderation
class CLIPContentModerator: """CLIP-based content moderation with customizable categories.""" def __init__(self, device="cuda"): self.device = device self.model, self.preprocess = clip.load("ViT-L/14", device=device) # Define moderation categories with detailed descriptions self.categories = { "safe": [ "a safe, appropriate photograph", "a normal everyday scene", "a family-friendly image" ], "nsfw": [ "an explicit or sexual image", "inappropriate adult content", "nudity" ], "violence": [ "a violent or graphic scene", "gore or injury", "weapons being used aggressively" ], "hate": [ "a hateful or discriminatory symbol", "extremist propaganda", "offensive graffiti or text" ] } # Pre-compute text embeddings for each category self.category_embeddings = {} for cat, descriptions in self.categories.items(): tokens = clip.tokenize(descriptions).to(device) with torch.no_grad(): embeddings = self.model.encode_text(tokens) embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True) self.category_embeddings[cat] = embeddings.mean(dim=0, keepdim=True) self.category_embeddings[cat] /= self.category_embeddings[cat].norm() def moderate(self, image_path, threshold=0.3): """Check image against moderation categories.""" image = self.preprocess(Image.open(image_path)).unsqueeze(0).to(self.device) with torch.no_grad(): image_embedding = self.model.encode_image(image) image_embedding = image_embedding / image_embedding.norm(dim=-1, keepdim=True) results = {} for cat, cat_embedding in self.category_embeddings.items(): similarity = (image_embedding @ cat_embedding.T).item() results[cat] = similarity # Determine if flagged flagged_categories = [ cat for cat, score in results.items() if cat != "safe" and score > threshold ] return { "scores": results, "flagged": len(flagged_categories) > 0, "flagged_categories": flagged_categories }
Fine-Tuning CLIP with OpenCLIP
import open_clip import torch from torch.utils.data import DataLoader from PIL import Image # Load model for fine-tuning model, _, preprocess = open_clip.create_model_and_transforms( 'ViT-B-32', pretrained='laion2b_s34b_b79k' ) tokenizer = open_clip.get_tokenizer('ViT-B-32') # Custom dataset class ImageTextDataset(torch.utils.data.Dataset): def __init__(self, data, preprocess, tokenizer): self.data = data # List of {"image_path": ..., "caption": ...} self.preprocess = preprocess self.tokenizer = tokenizer def __len__(self): return len(self.data) def __getitem__(self, idx): item = self.data[idx] image = self.preprocess(Image.open(item["image_path"])) text = self.tokenizer([item["caption"]])[0] return image, text # Training loop optimizer = torch.optim.AdamW(model.parameters(), lr=1e-6, weight_decay=0.1) dataset = ImageTextDataset(training_data, preprocess, tokenizer) loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4) model.train() for epoch in range(5): for images, texts in loader: images, texts = images.to(device), texts.to(device) image_features, text_features, logit_scale = model(images, texts) # Contrastive loss (InfoNCE) logit_scale = logit_scale.exp() logits_per_image = logit_scale * image_features @ text_features.T logits_per_text = logits_per_image.T labels = torch.arange(len(images)).to(device) loss = ( torch.nn.functional.cross_entropy(logits_per_image, labels) + torch.nn.functional.cross_entropy(logits_per_text, labels) ) / 2 optimizer.zero_grad() loss.backward() optimizer.step() print(f"Epoch {epoch}: loss={loss.item():.4f}")
Batch Processing for Efficiency
def batch_encode_images(model, preprocess, image_paths, batch_size=64, device="cuda"): """Efficiently encode many images in batches.""" all_embeddings = [] for i in range(0, len(image_paths), batch_size): batch_paths = image_paths[i:i + batch_size] images = [] for path in batch_paths: try: img = preprocess(Image.open(path)) images.append(img) except Exception as e: print(f"Skipping {path}: {e}") images.append(preprocess(Image.new("RGB", (224, 224)))) # Placeholder batch = torch.stack(images).to(device) with torch.no_grad(), torch.cuda.amp.autocast(): embeddings = model.encode_image(batch) embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True) all_embeddings.append(embeddings.cpu()) return torch.cat(all_embeddings, dim=0)
Configuration Reference
| Model | Params | Embed Dim | Resolution | Encode Time (GPU) | Quality |
|---|---|---|---|---|---|
| RN50 | 102M | 1024 | 224 | ~15ms | Good |
| ViT-B/32 | 151M | 512 | 224 | ~10ms | Better |
| ViT-B/16 | 150M | 512 | 224 | ~20ms | Better+ |
| ViT-L/14 | 428M | 768 | 224 | ~40ms | Best |
| ViT-L/14@336 | 428M | 768 | 336 | ~60ms | Highest |
| OpenCLIP Model | Training Data | Params | Zero-shot IN Acc |
|---|---|---|---|
| ViT-B-32 (laion2b) | LAION-2B | 151M | 66.6% |
| ViT-L-14 (laion2b) | LAION-2B | 428M | 75.3% |
| ViT-H-14 (laion2b) | LAION-2B | 986M | 78.0% |
| ViT-G-14 (laion2b) | LAION-2B | 1.8B | 80.1% |
| EVA02-E-14+ | Merged-2B+ | 4.4B | 82.0% |
| Operation | VRAM (ViT-B/32) | VRAM (ViT-L/14) | Throughput |
|---|---|---|---|
| Image encoding | ~1 GB | ~2 GB | ~200 img/s (GPU) |
| Text encoding | ~0.5 GB | ~1 GB | ~1000 text/s (GPU) |
| Similarity compute | Negligible | Negligible | Instant |
| Fine-tuning | ~8 GB | ~16 GB | ~50 img/s |
Best Practices
-
Always normalize embeddings before computing similarity: CLIP embeddings must be L2-normalized. Raw embeddings give incorrect similarity scores. Apply
embedding / embedding.norm(dim=-1, keepdim=True). -
Use prompt engineering for better zero-shot accuracy: Instead of bare labels ("dog"), use templates: "a photograph of a {label}" or "a centered satellite photo of {label}". This matches CLIP's training data distribution.
-
Use FAISS for large-scale similarity search: For collections over 100K images, replace brute-force similarity with FAISS approximate nearest neighbor search for orders-of-magnitude faster retrieval.
-
Batch encode for throughput: Process images in batches of 32-128 rather than one at a time. Batch encoding is 10-50x faster due to GPU parallelism.
-
Use OpenCLIP for the best quality: OpenCLIP models trained on LAION-2B significantly outperform the original OpenAI CLIP models. ViT-H-14 from OpenCLIP is the recommended default for quality.
-
Cache all computed embeddings: Store embeddings in FAISS indexes or numpy files. Re-computing embeddings is the most expensive operation in any CLIP pipeline.
-
Use multiple text descriptions per category: For content moderation and classification, average embeddings from 3-5 descriptions per category to reduce noise and improve accuracy.
-
Enable mixed precision for encoding: Use
torch.cuda.amp.autocast()during encoding for 30-50% speedup with negligible quality impact. -
Consider SigLIP for better calibrated scores: SigLIP uses sigmoid loss instead of softmax, producing scores that are more naturally interpretable as probabilities without needing the softmax over all candidates.
-
Pre-filter with CLIP, verify with VLM: For high-stakes applications, use CLIP for fast initial filtering, then verify top candidates with a more capable model like LLaVA for detailed analysis.
Troubleshooting
All similarity scores are near zero or near one
Embeddings are not normalized. Add features = features / features.norm(dim=-1, keepdim=True) after encoding.
Zero-shot accuracy is poor Use more descriptive text labels. Try prompt templates like "a photo of a {label}". Use a larger model (ViT-L/14 instead of ViT-B/32). Consider OpenCLIP models for better quality.
FAISS search returns wrong results
Ensure you are using IndexFlatIP (inner product) for normalized embeddings, not IndexFlatL2 (Euclidean distance). Verify embeddings are float32.
Out of memory when encoding large batches
Reduce batch size from 64 to 16 or 32. Use torch.no_grad() context. Enable mixed precision with torch.cuda.amp.autocast().
Fine-tuned CLIP quality degraded Use a very low learning rate (1e-6 to 1e-7) to avoid catastrophic forgetting. Freeze the vision encoder and only fine-tune the text encoder. Use more training data.
Content moderation has false positives Add more nuanced "safe" descriptions. Increase the moderation threshold. Use ensemble scoring across multiple description variants per category.
Text tokenization truncates long descriptions CLIP's text encoder has a 77-token maximum. Split long descriptions into multiple shorter ones and average their embeddings. For long-form text matching, consider models designed for longer text.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.