Guide Model Navigator

An autonomous agent that helps developers select the right AI model for their use case — comparing capabilities, benchmarks, pricing, and integration requirements across open-source and commercial model ecosystems.

When to Use This Agent

Choose Guide Model Navigator when:

You need to select an AI model and are overwhelmed by the choices (100s of models on HuggingFace, multiple commercial APIs)
You want objective benchmark comparisons for your specific task type
You need to balance quality, cost, latency, and deployment constraints
You are evaluating open-source vs commercial models for your use case

Consider alternatives when:

You have already selected a model and need deployment help (use an LLM architect agent)
You need to train a custom model from scratch (use an ML engineer agent)
You are comparing non-AI tools or services

Quick Start


# .claude/agents/model-navigator.yml
name: guide-model-navigator
description: Select the optimal AI model for your use case
agent_prompt: |
  You are a Model Navigator. Help users select the right AI model.

  Evaluation process:
  1. Understand the task: classification, generation, embedding, etc.
  2. Define constraints: budget, latency, privacy, deployment env
  3. Shortlist candidates from commercial and open-source options
  4. Compare on relevant benchmarks and real-world performance
  5. Recommend primary + backup models with justification
  6. Provide integration quickstart code

  Always consider: self-hosted options for privacy-sensitive use cases,
  API options for speed-to-market, and emerging models that may be underpriced.

Example invocation:


claude "We need an embedding model for semantic search over 2M technical documents. Self-hosted preferred, budget $500/month for GPU."

Sample recommendation:

Model Recommendation — Technical Document Embeddings
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Primary: BGE-large-en-v1.5 (BAAI)
  Dimensions: 1024
  MTEB Score: 63.98 (top-5 for retrieval)
  Inference: ~250 docs/sec on A10G
  Memory: 1.3GB (fits single GPU)
  License: MIT (commercial use OK)

Alternative: E5-mistral-7b-instruct
  Dimensions: 4096
  MTEB Score: 66.63 (higher quality)
  Inference: ~80 docs/sec on A10G
  Memory: 14GB (needs larger GPU)
  Consideration: 3x slower, slightly better quality

NOT Recommended:
  OpenAI text-embedding-3-large: $0.13/1M tokens
    At 2M docs ≈ $520/initial + $100/month for updates
    Comparable quality but data leaves your infrastructure

Deployment: BGE-large on single A10G ($0.60/hr spot)
  Initial embedding: 2M docs × 4 chunks avg = 8M chunks
  Time: ~9 hours (one-time), Cost: ~$5.40
  Monthly re-embedding budget: well within $500

Core Concepts

Model Selection Decision Tree

Task Type	Budget	Privacy	Recommended Category
Text Generation	< $100/mo	Flexible	Commercial API (Claude, GPT)
Text Generation	> $1K/mo	Strict	Self-hosted (Llama, Mistral)
Embeddings	Any	Flexible	API (OpenAI, Cohere)
Embeddings	Any	Strict	Self-hosted (BGE, E5, GTE)
Classification	Low	Any	Fine-tuned small model
Image Generation	Any	Flexible	API (DALL-E, Stable Diffusion)
Code Generation	Any	Strict	Self-hosted (CodeLlama, DeepSeek)

Benchmark Comparison Framework


# Model evaluation script
import time
import numpy as np

def evaluate_model(model, test_dataset, metrics):
    results = {
        "quality": {},
        "performance": {},
        "cost": {}
    }

    # Quality metrics
    predictions = []
    start = time.time()
    for sample in test_dataset:
        pred = model.predict(sample["input"])
        predictions.append(pred)
    total_time = time.time() - start

    # Calculate metrics
    results["quality"]["accuracy"] = calculate_accuracy(predictions, test_dataset)
    results["quality"]["f1"] = calculate_f1(predictions, test_dataset)

    # Performance metrics
    results["performance"]["total_time"] = total_time
    results["performance"]["per_sample_ms"] = (total_time / len(test_dataset)) * 1000
    results["performance"]["throughput"] = len(test_dataset) / total_time

    # Cost estimation
    results["cost"]["per_1k_requests"] = model.estimate_cost(1000)
    results["cost"]["monthly_at_10k_daily"] = model.estimate_cost(10000 * 30)

    return results

Configuration

Option	Type	Default	Description
`taskType`	string	`"generation"`	Task: generation, embedding, classification, code
`deploymentEnv`	string	`"cloud"`	Environment: cloud, on-premise, edge
`privacyLevel`	string	`"standard"`	Privacy: standard, strict, air-gapped
`monthlyBudget`	number	`500`	Maximum monthly cost in USD
`latencyTarget`	number	`1000`	Target latency in ms per request
`includeBenchmarks`	boolean	`true`	Run benchmark comparisons

Best Practices

Test with YOUR data, not just public benchmarks — MMLU and HumanEval scores indicate general capability, but your specific use case may differ significantly. Create a 50-100 sample evaluation set from your actual data and compare models on it. A model that scores 5% lower on benchmarks may score 10% higher on your domain.
Factor in total cost of ownership, not just per-token price — Self-hosted models have no per-token cost but require GPU infrastructure, devops time, and maintenance. A $0.50/hr GPU running 24/7 is $360/month before engineering time. Compare total cost including infrastructure, setup time, and ongoing maintenance.
Always have a fallback model — No single provider guarantees 100% uptime. Select a backup model from a different provider that produces acceptable (not perfect) output. Test the fallback monthly to ensure it still works with your current prompts and output parsers.
Re-evaluate models every 3-6 months — The AI model landscape changes rapidly. A model that was best-in-class 6 months ago may be outperformed by a new release that costs half as much. Schedule quarterly model reviews to ensure you are not overpaying or missing quality improvements.
Start with the smallest model that meets your quality bar — Larger models are not always better for your specific task. A fine-tuned 7B model often outperforms a general-purpose 70B model on domain-specific tasks. Test small models first and only scale up if quality is insufficient.

Common Issues

Benchmark scores do not predict real-world performance — A model with the highest MMLU score produces worse results on your customer support task than a model ranked 10 positions lower. Benchmarks measure general knowledge, not domain-specific performance. Always create a custom evaluation set that mirrors your production data distribution.

Open-source model deployment is more complex than expected — Downloading a model from HuggingFace and running inference requires GPU drivers, CUDA setup, model quantization for memory constraints, and inference server configuration. Use managed platforms (Replicate, Together AI, Anyscale) for initial testing before investing in self-hosted infrastructure.

Model outputs change after provider updates — Commercial providers silently update their models (e.g., "GPT-4" may refer to different model versions over time), breaking your carefully tuned prompts. Pin to specific model versions when available (e.g., gpt-4-0613), log model version in your response metadata, and test against a golden dataset when switching versions.

⚠️ Loading Issue

Guide Model Navigator

Guide Model Navigator

When to Use This Agent

Quick Start

Core Concepts

Model Selection Decision Tree

Benchmark Comparison Framework

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner