Computer Vision Strategist
All-in-one agent covering computer, vision, image, processing. Includes structured workflows, validation checks, and reusable patterns for data ai.
Computer Vision Strategist
An agent for designing and implementing production-grade computer vision systems, covering model architecture selection, training pipelines, inference optimization, and deployment strategies for image and video processing applications.
When to Use This Agent
Choose Computer Vision Strategist when:
- Designing computer vision pipelines for detection, classification, or segmentation
- Selecting model architectures for specific vision tasks and constraints
- Optimizing inference performance for edge devices or high-throughput servers
- Building training pipelines with proper data augmentation and validation
- Implementing real-time video processing or image analysis systems
Consider alternatives when:
- Working with NLP or text-based AI models (use an NLP agent)
- Doing general data science without vision components (use a data science agent)
- Building web UIs with image uploads but no CV processing (use a frontend agent)
Quick Start
# .claude/agents/computer-vision-strategist.yml name: Computer Vision Strategist model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior computer vision engineer. Design production-grade vision systems covering model selection, training, optimization, and deployment. Prioritize inference speed, accuracy trade-offs, and operational simplicity.
Example invocation:
claude --agent computer-vision-strategist "Design a real-time object detection pipeline for retail shelf monitoring that runs on edge devices with 4GB RAM and no GPU"
Core Concepts
Task-Model Selection Matrix
| Task | Recommended Models | Speed/Accuracy Trade-off |
|---|---|---|
| Image Classification | EfficientNet, ConvNeXt, ViT | EfficientNet-B0 (fast) → ViT-L (accurate) |
| Object Detection | YOLOv8, RT-DETR, DINO | YOLOv8n (fast) → DINO (accurate) |
| Semantic Segmentation | DeepLabV3+, SegFormer | SegFormer-B0 (fast) → B5 (accurate) |
| Instance Segmentation | Mask R-CNN, YOLACT, SAM | YOLACT (fast) → SAM (versatile) |
| Pose Estimation | MediaPipe, RTMPose | MediaPipe (edge) → RTMPose-L (accurate) |
| OCR | PaddleOCR, TrOCR | PaddleOCR (fast) → TrOCR (accurate) |
| Video Action | SlowFast, VideoMAE | SlowFast-R50 (fast) → VideoMAE-L (accurate) |
Training Pipeline Architecture
Data Ingestion → Validation → Augmentation → Training → Evaluation
│ │ │ │ │
Label QA Schema check Albumentations Checkpoints Metrics
Class balance Corruption Random crop/flip Early stop mAP/F1
Split verify Duplicates Mosaic/MixUp LR schedule Confusion
Inference Optimization Path
PyTorch Model (baseline)
↓ Export
ONNX Model (2-3x faster)
↓ Quantize
INT8 ONNX (2x faster, <1% accuracy loss)
↓ Platform-specific
TensorRT (NVIDIA) / CoreML (Apple) / TFLite (Mobile)
↓ Additional
Batching + async preprocessing + result caching
Configuration
| Parameter | Description | Default |
|---|---|---|
framework | Deep learning framework | PyTorch |
export_format | Model export target | ONNX |
input_resolution | Default input image size | 640×640 |
batch_size | Inference batch size | 1 |
quantization | Quantization strategy | FP16 |
augmentation_lib | Data augmentation library | Albumentations |
tracking | Experiment tracking tool | Weights & Biases |
Best Practices
-
Profile before optimizing. Measure where time is actually spent in your pipeline: data loading, preprocessing, model inference, postprocessing, or network transfer. Often preprocessing or postprocessing dominates total latency while engineers focus on making inference faster. Use profiling tools like PyTorch Profiler or NVIDIA Nsight to find real bottlenecks before applying optimizations.
-
Start with pretrained models and fine-tune. Training from scratch requires 10-100x more data and compute than fine-tuning. Use models pretrained on ImageNet, COCO, or domain-specific datasets. Fine-tune with a low learning rate (1e-4 to 1e-5) and freeze early layers initially. Unfreeze gradually if more adaptation is needed. This approach works even with a few hundred labeled images.
-
Build your data pipeline to be the fastest component. Model inference should be the bottleneck, not data loading. Use memory-mapped datasets, multithreaded data loading, and prefetching to keep the GPU fed. Decode images on CPU while the GPU processes the previous batch. A pipeline that starves the GPU wastes expensive compute resources.
-
Validate on data that matches production conditions. Test sets with studio-quality images will overestimate performance on user-uploaded photos. Include challenging conditions in your validation set: varying lighting, motion blur, partial occlusion, unusual angles, and low resolution. If your production images come from specific cameras, include samples from those exact cameras.
-
Version your datasets alongside your models. When model performance changes, you need to know whether the data or the model changed. Use DVC or a similar tool to version datasets with the same rigor as code. Track data splits, annotation versions, and augmentation configurations. Reproducible training requires reproducible data.
Common Issues
Model accuracy drops when moving from validation to production. This domain gap typically stems from differences in image quality, lighting, scale, or class distribution between training data and real-world inputs. Address it by collecting and labeling a representative sample of production data, applying domain-appropriate augmentations during training, and monitoring production prediction distributions against validation distributions to catch drift early.
Inference is too slow for real-time requirements. Work through the optimization path systematically: export to ONNX, apply FP16 quantization, then platform-specific optimization (TensorRT for NVIDIA GPUs). If still too slow, reduce input resolution (halving resolution gives roughly 4x speedup), use a smaller model variant, or implement temporal tricks for video (run detection every Nth frame and track between detections).
Training loss decreases but validation accuracy stagnates. This classic overfitting pattern in vision models is often caused by insufficient data augmentation or too large a model for the dataset size. Apply stronger augmentations (random erasing, cutout, mixup), use a smaller model backbone, add dropout or weight decay, and verify your validation set doesn't leak into training. If the dataset is genuinely small, consider few-shot learning approaches or synthetic data generation.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.