Guide Model Navigator
Boost productivity using this model, evaluation, benchmarking, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.
Guide Model Navigator
An autonomous agent that helps developers select the right AI model for their use case ā comparing capabilities, benchmarks, pricing, and integration requirements across open-source and commercial model ecosystems.
When to Use This Agent
Choose Guide Model Navigator when:
- You need to select an AI model and are overwhelmed by the choices (100s of models on HuggingFace, multiple commercial APIs)
- You want objective benchmark comparisons for your specific task type
- You need to balance quality, cost, latency, and deployment constraints
- You are evaluating open-source vs commercial models for your use case
Consider alternatives when:
- You have already selected a model and need deployment help (use an LLM architect agent)
- You need to train a custom model from scratch (use an ML engineer agent)
- You are comparing non-AI tools or services
Quick Start
# .claude/agents/model-navigator.yml name: guide-model-navigator description: Select the optimal AI model for your use case agent_prompt: | You are a Model Navigator. Help users select the right AI model. Evaluation process: 1. Understand the task: classification, generation, embedding, etc. 2. Define constraints: budget, latency, privacy, deployment env 3. Shortlist candidates from commercial and open-source options 4. Compare on relevant benchmarks and real-world performance 5. Recommend primary + backup models with justification 6. Provide integration quickstart code Always consider: self-hosted options for privacy-sensitive use cases, API options for speed-to-market, and emerging models that may be underpriced.
Example invocation:
claude "We need an embedding model for semantic search over 2M technical documents. Self-hosted preferred, budget $500/month for GPU."
Sample recommendation:
Model Recommendation ā Technical Document Embeddings
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Primary: BGE-large-en-v1.5 (BAAI)
Dimensions: 1024
MTEB Score: 63.98 (top-5 for retrieval)
Inference: ~250 docs/sec on A10G
Memory: 1.3GB (fits single GPU)
License: MIT (commercial use OK)
Alternative: E5-mistral-7b-instruct
Dimensions: 4096
MTEB Score: 66.63 (higher quality)
Inference: ~80 docs/sec on A10G
Memory: 14GB (needs larger GPU)
Consideration: 3x slower, slightly better quality
NOT Recommended:
OpenAI text-embedding-3-large: $0.13/1M tokens
At 2M docs ā $520/initial + $100/month for updates
Comparable quality but data leaves your infrastructure
Deployment: BGE-large on single A10G ($0.60/hr spot)
Initial embedding: 2M docs Ć 4 chunks avg = 8M chunks
Time: ~9 hours (one-time), Cost: ~$5.40
Monthly re-embedding budget: well within $500
Core Concepts
Model Selection Decision Tree
| Task Type | Budget | Privacy | Recommended Category |
|---|---|---|---|
| Text Generation | < $100/mo | Flexible | Commercial API (Claude, GPT) |
| Text Generation | > $1K/mo | Strict | Self-hosted (Llama, Mistral) |
| Embeddings | Any | Flexible | API (OpenAI, Cohere) |
| Embeddings | Any | Strict | Self-hosted (BGE, E5, GTE) |
| Classification | Low | Any | Fine-tuned small model |
| Image Generation | Any | Flexible | API (DALL-E, Stable Diffusion) |
| Code Generation | Any | Strict | Self-hosted (CodeLlama, DeepSeek) |
Benchmark Comparison Framework
# Model evaluation script import time import numpy as np def evaluate_model(model, test_dataset, metrics): results = { "quality": {}, "performance": {}, "cost": {} } # Quality metrics predictions = [] start = time.time() for sample in test_dataset: pred = model.predict(sample["input"]) predictions.append(pred) total_time = time.time() - start # Calculate metrics results["quality"]["accuracy"] = calculate_accuracy(predictions, test_dataset) results["quality"]["f1"] = calculate_f1(predictions, test_dataset) # Performance metrics results["performance"]["total_time"] = total_time results["performance"]["per_sample_ms"] = (total_time / len(test_dataset)) * 1000 results["performance"]["throughput"] = len(test_dataset) / total_time # Cost estimation results["cost"]["per_1k_requests"] = model.estimate_cost(1000) results["cost"]["monthly_at_10k_daily"] = model.estimate_cost(10000 * 30) return results
Configuration
| Option | Type | Default | Description |
|---|---|---|---|
taskType | string | "generation" | Task: generation, embedding, classification, code |
deploymentEnv | string | "cloud" | Environment: cloud, on-premise, edge |
privacyLevel | string | "standard" | Privacy: standard, strict, air-gapped |
monthlyBudget | number | 500 | Maximum monthly cost in USD |
latencyTarget | number | 1000 | Target latency in ms per request |
includeBenchmarks | boolean | true | Run benchmark comparisons |
Best Practices
-
Test with YOUR data, not just public benchmarks ā MMLU and HumanEval scores indicate general capability, but your specific use case may differ significantly. Create a 50-100 sample evaluation set from your actual data and compare models on it. A model that scores 5% lower on benchmarks may score 10% higher on your domain.
-
Factor in total cost of ownership, not just per-token price ā Self-hosted models have no per-token cost but require GPU infrastructure, devops time, and maintenance. A $0.50/hr GPU running 24/7 is $360/month before engineering time. Compare total cost including infrastructure, setup time, and ongoing maintenance.
-
Always have a fallback model ā No single provider guarantees 100% uptime. Select a backup model from a different provider that produces acceptable (not perfect) output. Test the fallback monthly to ensure it still works with your current prompts and output parsers.
-
Re-evaluate models every 3-6 months ā The AI model landscape changes rapidly. A model that was best-in-class 6 months ago may be outperformed by a new release that costs half as much. Schedule quarterly model reviews to ensure you are not overpaying or missing quality improvements.
-
Start with the smallest model that meets your quality bar ā Larger models are not always better for your specific task. A fine-tuned 7B model often outperforms a general-purpose 70B model on domain-specific tasks. Test small models first and only scale up if quality is insufficient.
Common Issues
Benchmark scores do not predict real-world performance ā A model with the highest MMLU score produces worse results on your customer support task than a model ranked 10 positions lower. Benchmarks measure general knowledge, not domain-specific performance. Always create a custom evaluation set that mirrors your production data distribution.
Open-source model deployment is more complex than expected ā Downloading a model from HuggingFace and running inference requires GPU drivers, CUDA setup, model quantization for memory constraints, and inference server configuration. Use managed platforms (Replicate, Together AI, Anyscale) for initial testing before investing in self-hosted infrastructure.
Model outputs change after provider updates ā Commercial providers silently update their models (e.g., "GPT-4" may refer to different model versions over time), breaking your carefully tuned prompts. Pin to specific model versions when available (e.g., gpt-4-0613), log model version in your response metadata, and test against a golden dataset when switching versions.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.