C

Comprehensive Multimodal Module

Comprehensive skill designed for pytorch, library, audio, generation. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Comprehensive Multimodal AI Development Module

Overview

Multimodal AI has evolved from a research curiosity to a production-ready capability, with models that jointly understand text, images, audio, and video reaching deployment readiness in 2025. This comprehensive module covers the full landscape of multimodal AI development: from understanding the core architectural patterns (contrastive learning, visual instruction tuning, cross-attention fusion) through practical implementation with frameworks like HuggingFace Transformers, LLaVA, and CLIP, to production deployment with optimized inference pipelines. Whether you are building image-text search, visual question answering, document understanding, or multimodal agents, this guide provides the foundation.

When to Use

  • Visual question answering: You need models that can answer natural language questions about images or documents
  • Image-text search and retrieval: You want to build semantic search across images using text queries (or vice versa)
  • Document understanding: You need to extract, summarize, or reason over documents with mixed text, tables, and figures
  • Content moderation: You want to classify images or videos using natural language category descriptions
  • Multimodal agents: You are building AI agents that can perceive and reason about visual information alongside text
  • Image captioning and description: You need to generate natural language descriptions of visual content
  • Cross-modal embedding: You want unified embeddings that place images and text in the same vector space

Choose alternatives when: you only need text-only language models, image-only classification (use standard CNNs/ViTs), or audio-only processing (use Whisper/speech models directly).

Quick Start

# Install core dependencies pip install transformers torch torchvision Pillow accelerate # For CLIP-based tasks pip install git+https://github.com/openai/CLIP.git # For LLaVA-based visual reasoning pip install llava
# Quick zero-shot image classification with CLIP import torch import clip from PIL import Image device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device) text = clip.tokenize(["a dog", "a cat", "a car", "a building"]).to(device) with torch.no_grad(): logits_per_image, _ = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() labels = ["a dog", "a cat", "a car", "a building"] for label, prob in zip(labels, probs[0]): print(f"{label}: {prob:.1%}")

Core Concepts

Multimodal Architecture Patterns

There are three dominant approaches to building multimodal models:

1. Contrastive Learning (CLIP, SigLIP) Trains separate encoders for each modality and aligns them in a shared embedding space.

# CLIP architecture (simplified) class ContrastiveMultimodal: """ Image Encoder (ViT) --> Image Embedding --| |--> Cosine Similarity Text Encoder (Transformer) --> Text Embedding --| Training: maximize similarity of matching pairs, minimize similarity of non-matching pairs """ def __init__(self): self.image_encoder = VisionTransformer() # ViT-B/32 self.text_encoder = TextTransformer() # 12-layer transformer self.temperature = nn.Parameter(torch.ones([]) * 0.07) def forward(self, images, texts): image_features = self.image_encoder(images) text_features = self.text_encoder(texts) # Normalize embeddings image_features = image_features / image_features.norm(dim=-1, keepdim=True) text_features = text_features / text_features.norm(dim=-1, keepdim=True) # Compute similarity matrix logits = (image_features @ text_features.T) / self.temperature return logits

2. Visual Instruction Tuning (LLaVA, InternVL) Connects a pretrained vision encoder to a pretrained LLM through a projection layer.

# LLaVA architecture (simplified) class VisualInstructionModel: """ Image --> Vision Encoder (CLIP ViT) --> Projection --> [Image Tokens] | Text Prompt --> Tokenizer --> [Text Tokens] + [Image Tokens] --> LLM --> Response """ def __init__(self): self.vision_encoder = CLIPVisionModel() # Frozen or fine-tuned self.projection = nn.Linear(1024, 4096) # Bridge vision to LLM self.llm = LlamaForCausalLM() # Language model def forward(self, image, text_prompt): # Encode image into patch embeddings vision_features = self.vision_encoder(image) # [1, 576, 1024] # Project to LLM embedding space image_tokens = self.projection(vision_features) # [1, 576, 4096] # Concatenate with text tokens and generate text_tokens = self.llm.embed_tokens(text_prompt) combined = torch.cat([image_tokens, text_tokens], dim=1) return self.llm.generate(inputs_embeds=combined)

3. Cross-Attention Fusion (Flamingo, Llama 3.2 Vision) Injects visual information into the LLM via cross-attention layers interspersed between self-attention layers.

# Cross-attention fusion (simplified) class CrossAttentionFusion: """ Image --> Vision Encoder --> Perceiver Resampler --> Visual Tokens (fixed count) | LLM Layer N: Self-Attention --> Cross-Attention(visual) --> MLP """ def __init__(self): self.vision_encoder = VisionTransformer() self.perceiver = PerceiverResampler(num_queries=64) # Fixed output size # Cross-attention layers injected into every 4th LLM layer

Building an Image Search Engine

import torch import clip from PIL import Image import numpy as np device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-L/14", device=device) # Step 1: Index images def build_image_index(image_paths): embeddings = [] for path in image_paths: image = preprocess(Image.open(path)).unsqueeze(0).to(device) with torch.no_grad(): embedding = model.encode_image(image) embedding = embedding / embedding.norm(dim=-1, keepdim=True) embeddings.append(embedding.cpu().numpy()) return np.vstack(embeddings) image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"] # Your image collection index = build_image_index(image_paths) # Step 2: Search with text query def search(query, index, image_paths, top_k=5): text_input = clip.tokenize([query]).to(device) with torch.no_grad(): text_embedding = model.encode_text(text_input) text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True) similarities = (text_embedding.cpu().numpy() @ index.T).squeeze() top_indices = similarities.argsort()[::-1][:top_k] return [(image_paths[i], similarities[i]) for i in top_indices] results = search("sunset over the ocean", index, image_paths) for path, score in results: print(f"{path}: {score:.3f}")

Visual Question Answering with LLaVA

from transformers import LlavaForConditionalGeneration, AutoProcessor from PIL import Image # Load LLaVA model model_id = "llava-hf/llava-v1.6-mistral-7b-hf" model = LlavaForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto" ) processor = AutoProcessor.from_pretrained(model_id) # Ask a question about an image image = Image.open("chart.png") prompt = "USER: <image>\nDescribe this chart and explain the key trends.\nASSISTANT:" inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda") output = model.generate(**inputs, max_new_tokens=300) response = processor.decode(output[0], skip_special_tokens=True) print(response)

Document Understanding Pipeline

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor from PIL import Image # Load a document-understanding model model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto" ) processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct") # Process a document image document = Image.open("invoice.png") messages = [ {"role": "user", "content": [ {"type": "image", "image": document}, {"type": "text", "text": "Extract all line items with quantities and prices from this invoice."} ]} ] text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=[text], images=[document], return_tensors="pt").to("cuda") output = model.generate(**inputs, max_new_tokens=500) result = processor.batch_decode(output, skip_special_tokens=True) print(result[0])

Multimodal Embeddings with Vector Databases

import chromadb import clip import torch from PIL import Image # Create embedding function using CLIP model, preprocess = clip.load("ViT-B/32") def get_image_embedding(image_path): image = preprocess(Image.open(image_path)).unsqueeze(0) with torch.no_grad(): embedding = model.encode_image(image) embedding = embedding / embedding.norm(dim=-1, keepdim=True) return embedding.squeeze().numpy().tolist() def get_text_embedding(text): tokens = clip.tokenize([text]) with torch.no_grad(): embedding = model.encode_text(tokens) embedding = embedding / embedding.norm(dim=-1, keepdim=True) return embedding.squeeze().numpy().tolist() # Store in ChromaDB client = chromadb.Client() collection = client.create_collection("multimodal_search") # Index images for i, path in enumerate(image_paths): collection.add( embeddings=[get_image_embedding(path)], metadatas=[{"type": "image", "path": path}], ids=[f"img_{i}"] ) # Search with text results = collection.query( query_embeddings=[get_text_embedding("a red sports car")], n_results=5 )

Configuration Reference

ModelTypeParametersBest For
CLIP ViT-B/32Contrastive151MFast search, classification
CLIP ViT-L/14Contrastive428MHigh-quality embeddings
SigLIPContrastive400MBetter calibrated scores
LLaVA-1.6-7BInstruction tuned7BVisual QA, description
LLaVA-1.6-13BInstruction tuned13BComplex visual reasoning
Qwen2-VL-7BInstruction tuned7BDocument understanding
InternVL2-8BInstruction tuned8BGeneral multimodal
Llama-3.2-11B-VisionCross-attention11BProduction deployment
TaskRecommended ModelVRAMLatency
Image search/classificationCLIP ViT-B/322 GB~20ms
Content moderationCLIP ViT-L/144 GB~50ms
Visual QA (basic)LLaVA-7B16 GB~2s
Document extractionQwen2-VL-7B16 GB~3s
Complex reasoningLLaVA-13B32 GB~4s

Best Practices

  1. Always normalize embeddings for similarity search: CLIP and similar models produce raw embeddings that must be L2-normalized before computing cosine similarity. Skipping normalization produces incorrect rankings.

  2. Use descriptive text labels for zero-shot classification: "a photograph of a golden retriever dog" works much better than "dog" as a CLIP text query. Prompt engineering matters for multimodal models.

  3. Cache embeddings for large-scale search: Computing CLIP embeddings is expensive. Store pre-computed embeddings in a vector database (ChromaDB, FAISS, Pinecone) and only compute query embeddings at search time.

  4. Choose the right architecture for your latency budget: CLIP gives millisecond responses for classification and search. LLaVA-style models take seconds but provide richer, generative answers.

  5. Preprocess images with the model's native preprocessor: Every multimodal model has specific image preprocessing (resize, normalize, crop). Always use the provided preprocess function, never custom PIL transforms.

  6. Use batch processing for throughput: Process multiple images or queries in a single forward pass for 5-10x throughput improvement over sequential processing.

  7. Start with frozen encoders, then fine-tune: When building custom multimodal systems, first train only the projection layer with frozen vision and language encoders, then optionally fine-tune end-to-end.

  8. Evaluate on task-specific benchmarks: General multimodal benchmarks (VQAv2, MMMU) may not correlate with your specific task. Build a domain-specific evaluation set.

  9. Consider model quantization for deployment: 4-bit quantized LLaVA models run on 8GB VRAM with minimal quality degradation, enabling deployment on consumer GPUs.

  10. Monitor for hallucination in generative models: LLaVA and similar models can hallucinate objects not present in images. Validate critical outputs and consider ensemble approaches for high-stakes applications.

Troubleshooting

CLIP similarity scores are all near zero Embeddings are not normalized. Always divide by the L2 norm: embedding = embedding / embedding.norm(dim=-1, keepdim=True).

LLaVA gives generic responses ignoring the image Verify the image token <image> is in the correct position in the prompt. Check that the image is properly preprocessed and passed to the model.

Out of memory loading multimodal model Use 4-bit quantization: load_in_4bit=True in from_pretrained(). Use device_map="auto" to split across GPU and CPU. Consider a smaller model variant.

Poor zero-shot classification accuracy Use more descriptive text labels. Try prompt templates: "a photo of {label}", "a centered photo of {label}". Use the larger CLIP model (ViT-L/14 vs ViT-B/32).

Image search returns irrelevant results Ensure images and queries are encoded with the same model. Verify embedding dimensions match. Check that the similarity metric is cosine (not Euclidean distance).

Document extraction missing table content Use models specifically trained for document understanding (Qwen2-VL, InternVL). Increase image resolution in preprocessing. Split large documents into pages and process individually.

Slow inference with large multimodal models Enable Flash Attention if available. Use FP16 or BF16 precision. Consider CLIP for tasks that do not require generation (classification, search).

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates