C

Computer Vision Strategist

All-in-one agent covering computer, vision, image, processing. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

Computer Vision Strategist

An agent for designing and implementing production-grade computer vision systems, covering model architecture selection, training pipelines, inference optimization, and deployment strategies for image and video processing applications.

When to Use This Agent

Choose Computer Vision Strategist when:

  • Designing computer vision pipelines for detection, classification, or segmentation
  • Selecting model architectures for specific vision tasks and constraints
  • Optimizing inference performance for edge devices or high-throughput servers
  • Building training pipelines with proper data augmentation and validation
  • Implementing real-time video processing or image analysis systems

Consider alternatives when:

  • Working with NLP or text-based AI models (use an NLP agent)
  • Doing general data science without vision components (use a data science agent)
  • Building web UIs with image uploads but no CV processing (use a frontend agent)

Quick Start

# .claude/agents/computer-vision-strategist.yml name: Computer Vision Strategist model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior computer vision engineer. Design production-grade vision systems covering model selection, training, optimization, and deployment. Prioritize inference speed, accuracy trade-offs, and operational simplicity.

Example invocation:

claude --agent computer-vision-strategist "Design a real-time object detection pipeline for retail shelf monitoring that runs on edge devices with 4GB RAM and no GPU"

Core Concepts

Task-Model Selection Matrix

TaskRecommended ModelsSpeed/Accuracy Trade-off
Image ClassificationEfficientNet, ConvNeXt, ViTEfficientNet-B0 (fast) → ViT-L (accurate)
Object DetectionYOLOv8, RT-DETR, DINOYOLOv8n (fast) → DINO (accurate)
Semantic SegmentationDeepLabV3+, SegFormerSegFormer-B0 (fast) → B5 (accurate)
Instance SegmentationMask R-CNN, YOLACT, SAMYOLACT (fast) → SAM (versatile)
Pose EstimationMediaPipe, RTMPoseMediaPipe (edge) → RTMPose-L (accurate)
OCRPaddleOCR, TrOCRPaddleOCR (fast) → TrOCR (accurate)
Video ActionSlowFast, VideoMAESlowFast-R50 (fast) → VideoMAE-L (accurate)

Training Pipeline Architecture

Data Ingestion → Validation → Augmentation → Training → Evaluation
      │              │            │              │           │
  Label QA      Schema check  Albumentations   Checkpoints  Metrics
  Class balance  Corruption   Random crop/flip  Early stop   mAP/F1
  Split verify   Duplicates   Mosaic/MixUp     LR schedule  Confusion

Inference Optimization Path

PyTorch Model (baseline)
    ↓ Export
ONNX Model (2-3x faster)
    ↓ Quantize
INT8 ONNX (2x faster, <1% accuracy loss)
    ↓ Platform-specific
TensorRT (NVIDIA) / CoreML (Apple) / TFLite (Mobile)
    ↓ Additional
Batching + async preprocessing + result caching

Configuration

ParameterDescriptionDefault
frameworkDeep learning frameworkPyTorch
export_formatModel export targetONNX
input_resolutionDefault input image size640×640
batch_sizeInference batch size1
quantizationQuantization strategyFP16
augmentation_libData augmentation libraryAlbumentations
trackingExperiment tracking toolWeights & Biases

Best Practices

  1. Profile before optimizing. Measure where time is actually spent in your pipeline: data loading, preprocessing, model inference, postprocessing, or network transfer. Often preprocessing or postprocessing dominates total latency while engineers focus on making inference faster. Use profiling tools like PyTorch Profiler or NVIDIA Nsight to find real bottlenecks before applying optimizations.

  2. Start with pretrained models and fine-tune. Training from scratch requires 10-100x more data and compute than fine-tuning. Use models pretrained on ImageNet, COCO, or domain-specific datasets. Fine-tune with a low learning rate (1e-4 to 1e-5) and freeze early layers initially. Unfreeze gradually if more adaptation is needed. This approach works even with a few hundred labeled images.

  3. Build your data pipeline to be the fastest component. Model inference should be the bottleneck, not data loading. Use memory-mapped datasets, multithreaded data loading, and prefetching to keep the GPU fed. Decode images on CPU while the GPU processes the previous batch. A pipeline that starves the GPU wastes expensive compute resources.

  4. Validate on data that matches production conditions. Test sets with studio-quality images will overestimate performance on user-uploaded photos. Include challenging conditions in your validation set: varying lighting, motion blur, partial occlusion, unusual angles, and low resolution. If your production images come from specific cameras, include samples from those exact cameras.

  5. Version your datasets alongside your models. When model performance changes, you need to know whether the data or the model changed. Use DVC or a similar tool to version datasets with the same rigor as code. Track data splits, annotation versions, and augmentation configurations. Reproducible training requires reproducible data.

Common Issues

Model accuracy drops when moving from validation to production. This domain gap typically stems from differences in image quality, lighting, scale, or class distribution between training data and real-world inputs. Address it by collecting and labeling a representative sample of production data, applying domain-appropriate augmentations during training, and monitoring production prediction distributions against validation distributions to catch drift early.

Inference is too slow for real-time requirements. Work through the optimization path systematically: export to ONNX, apply FP16 quantization, then platform-specific optimization (TensorRT for NVIDIA GPUs). If still too slow, reduce input resolution (halving resolution gives roughly 4x speedup), use a smaller model variant, or implement temporal tricks for video (run detection every Nth frame and track between detections).

Training loss decreases but validation accuracy stagnates. This classic overfitting pattern in vision models is often caused by insufficient data augmentation or too large a model for the dataset size. Apply stronger augmentations (random erasing, cutout, mixup), use a smaller model backbone, add dropout or weight decay, and verify your validation set doesn't leak into training. If the dataset is genuinely small, consider few-shot learning approaches or synthetic data generation.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates