C

Comprehensive Multimodal Segment Anything

All-in-one skill covering foundation, model, image, segmentation. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Segment Anything Model (SAM) -- Comprehensive Multimodal Segmentation

Overview

A comprehensive skill for zero-shot image segmentation using Meta AI's Segment Anything Model (SAM). SAM enables segmenting any object in any image without task-specific training, using flexible prompt types including points, bounding boxes, and masks. Trained on the SA-1B dataset containing over 1.1 billion masks from 11 million images, SAM delivers state-of-the-art segmentation quality across domains -- from natural photos to medical imaging, satellite imagery, and microscopy. This skill covers the original SAM, SAM 2 for video segmentation, and integration with both the native library and HuggingFace Transformers.

When to Use

  • Segmenting any object in images without task-specific training or fine-tuning
  • Building interactive annotation and labeling tools with click-based prompts
  • Generating high-quality training data for downstream vision models
  • Processing medical, satellite, or domain-specific images with zero-shot transfer
  • Creating automatic segmentation masks for entire images
  • Building object cutout tools, background removal, or compositing pipelines
  • Combining with text-based detectors (GroundingDINO) for text-prompted segmentation

Quick Start

# Install SAM from GitHub pip install git+https://github.com/facebookresearch/segment-anything.git # Required dependencies pip install opencv-python pycocotools matplotlib # Download checkpoint (ViT-H -- most accurate, 2.4 GB) wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
import numpy as np import cv2 from segment_anything import sam_model_registry, SamPredictor # Load model sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth") sam.to(device="cuda") # Create predictor and set image predictor = SamPredictor(sam) image = cv2.imread("photo.jpg") image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) predictor.set_image(image) # Segment with a single point click input_point = np.array([[500, 375]]) input_label = np.array([1]) # 1 = foreground masks, scores, logits = predictor.predict( point_coords=input_point, point_labels=input_label, multimask_output=True, # Returns 3 candidate masks ) best_mask = masks[np.argmax(scores)] print(f"Best mask IoU score: {scores.max():.3f}")

Core Concepts

Architecture Overview

SAM uses a three-component design that separates heavyweight image encoding from lightweight prompt processing:

Input Image ──► Image Encoder (ViT) ──► Image Embeddings (computed once)
                                              │
Prompts (points/boxes/masks) ──► Prompt Encoder ──► Prompt Embeddings
                                              │
                               Image + Prompt Embeddings
                                              │
                                       Mask Decoder (lightweight transformer)
                                              │
                                    Output Masks + IoU Scores

The image encoder runs once per image and produces reusable embeddings. Multiple prompt queries can then be answered efficiently without re-encoding the image.

Model Variants

ModelCheckpointParametersSizeRelative SpeedAccuracy
ViT-Hsam_vit_h_4b8939.pth636M2.4 GBSlowestBest
ViT-Lsam_vit_l_0b3195.pth308M1.2 GBMediumVery Good
ViT-Bsam_vit_b_01ec64.pth91M375 MBFastestGood

Prompt Types

Prompt TypeInput FormatBest Use Case
Foreground Point(x, y) with label 1Clicking on an object to select it
Background Point(x, y) with label 0Excluding unwanted regions
Bounding Box[x1, y1, x2, y2]Selecting larger or ambiguous objects
Previous MaskLow-res logits from prior predictionIterative refinement of results
CombinedAny mix of points, boxes, masksPrecise multi-cue segmentation

Point Prompt Segmentation

import numpy as np from segment_anything import sam_model_registry, SamPredictor sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth") sam.to("cuda") predictor = SamPredictor(sam) predictor.set_image(image_rgb) # Single foreground point masks, scores, logits = predictor.predict( point_coords=np.array([[500, 375]]), point_labels=np.array([1]), multimask_output=True, ) # Multiple points: 2 foreground + 1 background for precision masks, scores, logits = predictor.predict( point_coords=np.array([[500, 375], [520, 390], [300, 200]]), point_labels=np.array([1, 1, 0]), multimask_output=False, # Single mask when prompts are unambiguous )

Bounding Box Prompts

# Box prompt: [x_min, y_min, x_max, y_max] input_box = np.array([425, 600, 700, 875]) masks, scores, logits = predictor.predict( box=input_box, multimask_output=False, )

Combined Prompts for Precision

# Combine box + point for maximum control masks, scores, logits = predictor.predict( point_coords=np.array([[500, 375]]), point_labels=np.array([1]), box=np.array([400, 300, 700, 600]), multimask_output=False, )

Iterative Mask Refinement

# First pass: coarse segmentation masks, scores, logits = predictor.predict( point_coords=np.array([[500, 375]]), point_labels=np.array([1]), multimask_output=True, ) # Second pass: refine using previous mask logits + additional points best_logit = logits[np.argmax(scores)] masks_refined, scores_refined, _ = predictor.predict( point_coords=np.array([[500, 375], [550, 400]]), point_labels=np.array([1, 0]), # Add background exclusion mask_input=best_logit[None, :, :], # Feed prior mask multimask_output=False, )

Automatic Mask Generation

Generate segmentation masks for every object in an image without manual prompts:

from segment_anything import SamAutomaticMaskGenerator mask_generator = SamAutomaticMaskGenerator( model=sam, points_per_side=32, # Density of point grid (32x32 = 1024 points) pred_iou_thresh=0.88, # Minimum predicted IoU quality stability_score_thresh=0.95, # Mask stability threshold crop_n_layers=1, # Multi-scale cropping layers crop_n_points_downscale_factor=2, min_mask_region_area=100, # Remove masks smaller than 100 pixels ) masks = mask_generator.generate(image_rgb) # Each mask dict contains: # - "segmentation": np.ndarray boolean mask # - "bbox": [x, y, w, h] # - "area": pixel count # - "predicted_iou": model confidence score # - "stability_score": robustness under perturbation # - "point_coords": the generating sample point # Sort by area and filter large_masks = sorted(masks, key=lambda m: m["area"], reverse=True) high_quality = [m for m in masks if m["predicted_iou"] > 0.92]

HuggingFace Transformers Integration

import torch from PIL import Image from transformers import SamModel, SamProcessor model = SamModel.from_pretrained("facebook/sam-vit-huge") processor = SamProcessor.from_pretrained("facebook/sam-vit-huge") model.to("cuda") image = Image.open("photo.jpg") # Point-prompted segmentation input_points = [[[450, 600]]] # Batch of point sets inputs = processor(image, input_points=input_points, return_tensors="pt") inputs = {k: v.to("cuda") for k, v in inputs.items()} with torch.no_grad(): outputs = model(**inputs) # Post-process to original image resolution masks = processor.image_processor.post_process_masks( outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu(), ) # Box-prompted segmentation input_boxes = [[[400, 300, 700, 600]]] inputs = processor(image, input_boxes=input_boxes, return_tensors="pt") inputs = {k: v.to("cuda") for k, v in inputs.items()} with torch.no_grad(): outputs = model(**inputs)

Text-Prompted Segmentation with GroundingDINO

Combine SAM with a text-based detector for open-vocabulary segmentation:

from groundingdino.util.inference import load_model, predict from segment_anything import sam_model_registry, SamPredictor import cv2 import numpy as np # Step 1: Detect objects with text prompt using GroundingDINO dino_model = load_model("GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py", "weights/groundingdino_swint_ogc.pth") image_source, image_transformed = load_image("photo.jpg") boxes, logits, phrases = predict( model=dino_model, image=image_transformed, caption="cat . dog . person", box_threshold=0.35, text_threshold=0.25, ) # Step 2: Segment detected boxes with SAM sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth") sam.to("cuda") predictor = SamPredictor(sam) image_rgb = cv2.cvtColor(cv2.imread("photo.jpg"), cv2.COLOR_BGR2RGB) predictor.set_image(image_rgb) h, w, _ = image_rgb.shape for box in boxes: # Convert normalized box to pixel coordinates box_px = (box * np.array([w, h, w, h])).astype(int) masks, scores, _ = predictor.predict(box=box_px, multimask_output=False) # Use masks for downstream tasks

ONNX Export for Edge Deployment

from segment_anything.utils.onnx import SamOnnxModel import torch # Export mask decoder to ONNX (lightweight, ~15MB) onnx_model = SamOnnxModel(sam, return_single_mask=True) dummy_inputs = { "image_embeddings": torch.randn(1, 256, 64, 64), "point_coords": torch.randint(0, 1024, (1, 2, 2), dtype=torch.float), "point_labels": torch.randint(0, 2, (1, 2), dtype=torch.float), "mask_input": torch.randn(1, 1, 256, 256), "has_mask_input": torch.tensor([1], dtype=torch.float), "orig_im_size": torch.tensor([1024, 1024], dtype=torch.float), } torch.onnx.export( onnx_model, tuple(dummy_inputs.values()), "sam_decoder.onnx", input_names=list(dummy_inputs.keys()), output_names=["masks", "iou_predictions", "low_res_masks"], )

Configuration Reference

SamPredictor.predict Parameters

ParameterTypeDefaultDescription
point_coordsnp.ndarrayNoneNx2 array of (x, y) point coordinates
point_labelsnp.ndarrayNoneN-length array, 1=foreground, 0=background
boxnp.ndarrayNoneLength-4 array [x1, y1, x2, y2]
mask_inputnp.ndarrayNone1x256x256 low-res mask from prior prediction
multimask_outputboolTrueReturn 3 candidate masks (True) or 1 (False)
return_logitsboolFalseReturn raw logits instead of binary masks

SamAutomaticMaskGenerator Parameters

ParameterTypeDefaultDescription
points_per_sideint32Grid sampling density per side
points_per_batchint64Batch size for point processing
pred_iou_threshfloat0.88Minimum predicted IoU to keep mask
stability_score_threshfloat0.95Minimum stability score
stability_score_offsetfloat1.0Offset for stability calculation
crop_n_layersint0Number of multi-scale crop layers
crop_n_points_downscale_factorint1Point reduction per crop layer
min_mask_region_areaint0Remove masks below this pixel area
output_modestr"binary_mask""binary_mask", "uncompressed_rle", or "coco_rle"

Best Practices

  1. Compute image embeddings once -- Call predictor.set_image() a single time per image, then run multiple prompt queries against the cached embeddings for interactive workflows.
  2. Use multimask_output=True for ambiguous prompts -- When a single point could match multiple objects, get three candidates and pick the highest-scoring one. Switch to multimask_output=False when prompts are specific (box + points).
  3. Start with ViT-B for prototyping -- The 375 MB ViT-B model is 3-4x faster than ViT-H and sufficient for initial development. Upgrade to ViT-H only when accuracy is critical.
  4. Combine prompts for precision -- A bounding box plus a foreground point consistently outperforms either prompt type alone, especially for irregularly shaped objects.
  5. Filter automatic masks aggressively -- Raise pred_iou_thresh and stability_score_thresh to reduce noisy small masks. Use min_mask_region_area to discard tiny fragments.
  6. Use iterative refinement for difficult objects -- Feed the best logits from a first prediction back as mask_input along with additional corrective points for complex shapes.
  7. Leverage ONNX export for deployment -- Export only the lightweight mask decoder (~15 MB) to ONNX for browser or edge deployment. Pre-compute image embeddings server-side.
  8. Pair with GroundingDINO for text prompts -- SAM itself has no text understanding. Combine it with GroundingDINO or OWLv2 for open-vocabulary segmentation driven by natural language.
  9. Apply GPU memory management -- For batch processing large images, move the model to GPU only during inference and clear CUDA cache between images using torch.cuda.empty_cache().
  10. Consider SAM 2 for video -- If your use case involves video or temporal consistency, use SAM 2 which extends the architecture with memory-based tracking across frames.

Troubleshooting

Model runs out of GPU memory with large images: SAM resizes images internally to 1024x1024. If memory is still tight, use ViT-B instead of ViT-H, or process on CPU for non-interactive workloads.

Automatic mask generator produces too many overlapping masks: Increase stability_score_thresh to 0.97 and pred_iou_thresh to 0.92. Reduce points_per_side from 32 to 16 for coarser coverage.

Point prompt selects the wrong object: Add a background point (label=0) on the unwanted object to exclude it. Alternatively, provide a bounding box around the intended target.

ONNX export fails with custom model modifications: The ONNX exporter expects the standard SAM architecture. If you have modified layers, trace with torch.jit.trace first or manually adapt the SamOnnxModel wrapper.

HuggingFace Transformers gives different results than native SAM: The Transformers implementation normalizes inputs differently. Ensure you use SamProcessor for preprocessing and post_process_masks for output conversion to match native behavior.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates