Computer Vision Strategist

An agent for designing and implementing production-grade computer vision systems, covering model architecture selection, training pipelines, inference optimization, and deployment strategies for image and video processing applications.

When to Use This Agent

Choose Computer Vision Strategist when:

Designing computer vision pipelines for detection, classification, or segmentation
Selecting model architectures for specific vision tasks and constraints
Optimizing inference performance for edge devices or high-throughput servers
Building training pipelines with proper data augmentation and validation
Implementing real-time video processing or image analysis systems

Consider alternatives when:

Working with NLP or text-based AI models (use an NLP agent)
Doing general data science without vision components (use a data science agent)
Building web UIs with image uploads but no CV processing (use a frontend agent)

Quick Start


# .claude/agents/computer-vision-strategist.yml
name: Computer Vision Strategist
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a senior computer vision engineer. Design production-grade
  vision systems covering model selection, training, optimization,
  and deployment. Prioritize inference speed, accuracy trade-offs,
  and operational simplicity.

Example invocation:


claude --agent computer-vision-strategist "Design a real-time
  object detection pipeline for retail shelf monitoring that runs
  on edge devices with 4GB RAM and no GPU"

Core Concepts

Task-Model Selection Matrix

Task	Recommended Models	Speed/Accuracy Trade-off
Image Classification	EfficientNet, ConvNeXt, ViT	EfficientNet-B0 (fast) → ViT-L (accurate)
Object Detection	YOLOv8, RT-DETR, DINO	YOLOv8n (fast) → DINO (accurate)
Semantic Segmentation	DeepLabV3+, SegFormer	SegFormer-B0 (fast) → B5 (accurate)
Instance Segmentation	Mask R-CNN, YOLACT, SAM	YOLACT (fast) → SAM (versatile)
Pose Estimation	MediaPipe, RTMPose	MediaPipe (edge) → RTMPose-L (accurate)
OCR	PaddleOCR, TrOCR	PaddleOCR (fast) → TrOCR (accurate)
Video Action	SlowFast, VideoMAE	SlowFast-R50 (fast) → VideoMAE-L (accurate)

Training Pipeline Architecture

Data Ingestion → Validation → Augmentation → Training → Evaluation
      │              │            │              │           │
  Label QA      Schema check  Albumentations   Checkpoints  Metrics
  Class balance  Corruption   Random crop/flip  Early stop   mAP/F1
  Split verify   Duplicates   Mosaic/MixUp     LR schedule  Confusion

Inference Optimization Path

PyTorch Model (baseline)
    ↓ Export
ONNX Model (2-3x faster)
    ↓ Quantize
INT8 ONNX (2x faster, <1% accuracy loss)
    ↓ Platform-specific
TensorRT (NVIDIA) / CoreML (Apple) / TFLite (Mobile)
    ↓ Additional
Batching + async preprocessing + result caching

Configuration

Parameter	Description	Default
`framework`	Deep learning framework	PyTorch
`export_format`	Model export target	ONNX
`input_resolution`	Default input image size	640×640
`batch_size`	Inference batch size	1
`quantization`	Quantization strategy	FP16
`augmentation_lib`	Data augmentation library	Albumentations
`tracking`	Experiment tracking tool	Weights & Biases

Best Practices

Profile before optimizing. Measure where time is actually spent in your pipeline: data loading, preprocessing, model inference, postprocessing, or network transfer. Often preprocessing or postprocessing dominates total latency while engineers focus on making inference faster. Use profiling tools like PyTorch Profiler or NVIDIA Nsight to find real bottlenecks before applying optimizations.
Start with pretrained models and fine-tune. Training from scratch requires 10-100x more data and compute than fine-tuning. Use models pretrained on ImageNet, COCO, or domain-specific datasets. Fine-tune with a low learning rate (1e-4 to 1e-5) and freeze early layers initially. Unfreeze gradually if more adaptation is needed. This approach works even with a few hundred labeled images.
Build your data pipeline to be the fastest component. Model inference should be the bottleneck, not data loading. Use memory-mapped datasets, multithreaded data loading, and prefetching to keep the GPU fed. Decode images on CPU while the GPU processes the previous batch. A pipeline that starves the GPU wastes expensive compute resources.
Validate on data that matches production conditions. Test sets with studio-quality images will overestimate performance on user-uploaded photos. Include challenging conditions in your validation set: varying lighting, motion blur, partial occlusion, unusual angles, and low resolution. If your production images come from specific cameras, include samples from those exact cameras.
Version your datasets alongside your models. When model performance changes, you need to know whether the data or the model changed. Use DVC or a similar tool to version datasets with the same rigor as code. Track data splits, annotation versions, and augmentation configurations. Reproducible training requires reproducible data.

Common Issues

Model accuracy drops when moving from validation to production. This domain gap typically stems from differences in image quality, lighting, scale, or class distribution between training data and real-world inputs. Address it by collecting and labeling a representative sample of production data, applying domain-appropriate augmentations during training, and monitoring production prediction distributions against validation distributions to catch drift early.

Inference is too slow for real-time requirements. Work through the optimization path systematically: export to ONNX, apply FP16 quantization, then platform-specific optimization (TensorRT for NVIDIA GPUs). If still too slow, reduce input resolution (halving resolution gives roughly 4x speedup), use a smaller model variant, or implement temporal tricks for video (run detection every Nth frame and track between detections).

Training loss decreases but validation accuracy stagnates. This classic overfitting pattern in vision models is often caused by insufficient data augmentation or too large a model for the dataset size. Apply stronger augmentations (random erasing, cutout, mixup), use a smaller model backbone, add dropout or weight decay, and verify your validation set doesn't leak into training. If the dataset is genuinely small, consider few-shot learning approaches or synthetic data generation.

⚠️ Loading Issue

Computer Vision Strategist

Computer Vision Strategist

When to Use This Agent

Quick Start

Core Concepts

Task-Model Selection Matrix

Training Pipeline Architecture

Inference Optimization Path

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner