Guide LLM Architect

An autonomous agent that designs production-ready LLM systems — selecting models, implementing RAG pipelines, optimizing inference costs, and deploying with safety guardrails, monitoring, and auto-scaling.

When to Use This Agent

Choose Guide LLM Architect when:

You are building an application powered by LLMs and need architecture guidance
You need to choose between fine-tuning, RAG, and prompt engineering approaches
Inference costs are growing and you need optimization strategies
You want production deployment with safety filters, monitoring, and fallbacks

Consider alternatives when:

You need to train a model from scratch (use an ML engineer agent)
Your focus is purely on prompt crafting (use a prompt engineer agent)
You need general backend architecture without LLM components

Quick Start


# .claude/agents/llm-architect.yml
name: guide-llm-architect
description: Design and deploy production LLM systems
agent_prompt: |
  You are an LLM Architect. When designing LLM systems:

  1. Analyze requirements: latency, accuracy, cost, scale
  2. Recommend model selection (hosted API vs self-hosted)
  3. Design the retrieval pipeline (RAG, fine-tuning, or hybrid)
  4. Implement safety: content filtering, injection defense, output validation
  5. Set up monitoring: latency, cost, quality, drift
  6. Plan scaling: auto-scale, caching, batching strategies

  Always start with the simplest approach that meets requirements.
  Hosted APIs before self-hosted. RAG before fine-tuning.

Example invocation:


claude "Design an LLM architecture for a customer support chatbot handling 10K conversations/day with our product documentation"

Sample architecture output:

LLM Architecture — Customer Support Chatbot
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Recommended Approach: RAG + Hosted API

Model: Claude 3.5 Sonnet (best quality/cost for support)
  Fallback: GPT-4o-mini (for cost reduction on simple queries)

RAG Pipeline:
  Documents → Chunking (512 tokens, 50 overlap)
    → Embedding (text-embedding-3-small)
    → Vector Store (Pinecone, 1536 dims)
    → Retrieval (top-5 + reranking with Cohere)
    → Generation (Claude with retrieved context)

Cost Estimate: $850/month at 10K conversations/day
  Embedding: $15/month (one-time + incremental)
  Vector DB: $70/month (Pinecone starter)
  LLM calls: $720/month (avg 2K tokens/conversation)
  Reranking: $45/month

Architecture:
  ┌─────────┐    ┌──────────┐    ┌───────────┐
  │  User   │───→│ Gateway  │───→│ Router    │
  └─────────┘    │ (safety) │    │ (simple/  │
                 └──────────┘    │  complex) │
                                 └─────┬─────┘
                           ┌───────────┴──────────┐
                     ┌─────▼─────┐          ┌─────▼─────┐
                     │ GPT-4o-   │          │ Claude +  │
                     │ mini      │          │ RAG       │
                     │ (simple)  │          │ (complex) │
                     └───────────┘          └───────────┘

Core Concepts

Decision Framework

Requirement	Use RAG	Use Fine-Tuning	Use Prompt Engineering
Domain knowledge	Company docs, FAQs	Specialized terminology, style	General knowledge
Update frequency	Daily/weekly changes	Rarely changes	Static instructions
Data volume	100s-1000s of docs	1000s of examples	5-20 examples
Latency budget	+200-500ms acceptable	Same as base model	Fastest
Cost	Per-query retrieval cost	One-time training cost	No extra cost

RAG Architecture Patterns


// Production RAG pipeline
class RAGPipeline {
  constructor(
    private vectorStore: VectorStore,
    private llm: LLMClient,
    private reranker: Reranker
  ) {}

  async query(userQuery: string, conversationHistory: Message[]): Promise<string> {
    // 1. Query expansion for better retrieval
    const expandedQueries = await this.expandQuery(userQuery, conversationHistory);

    // 2. Retrieve from vector store
    const candidates = await this.vectorStore.search(expandedQueries, { topK: 20 });

    // 3. Rerank for relevance
    const reranked = await this.reranker.rerank(userQuery, candidates, { topK: 5 });

    // 4. Build context with source attribution
    const context = this.buildContext(reranked);

    // 5. Generate with retrieved context
    const response = await this.llm.generate({
      system: SYSTEM_PROMPT,
      messages: [
        ...conversationHistory,
        { role: "user", content: `Context:\n${context}\n\nQuestion: ${userQuery}` }
      ],
      temperature: 0.3  // Lower for factual accuracy
    });

    return response;
  }

  private buildContext(documents: Document[]): string {
    return documents
      .map((doc, i) => `[Source ${i + 1}: ${doc.title}]\n${doc.content}`)
      .join('\n\n');
  }
}

Cost Optimization Strategies

Cost Reduction Techniques:
┌────────────────────────┬─────────────┬──────────────┐
│ Technique              │ Savings     │ Trade-off    │
├────────────────────────┼─────────────┼──────────────┤
│ Query routing (simple/ │ 40-60%      │ Routing      │
│ complex model split)   │             │ accuracy     │
│ Semantic caching       │ 20-40%      │ Cache misses │
│ Prompt compression     │ 15-25%      │ Minor quality│
│ Batch processing       │ 10-30%      │ Latency      │
│ Output length limits   │ 10-20%      │ Truncation   │
│ 4-bit quantization     │ 50-75%      │ Slight qual. │
│ (self-hosted)          │             │ reduction    │
└────────────────────────┴─────────────┴──────────────┘

Configuration

Option	Type	Default	Description
`approach`	string	`"rag"`	Architecture: rag, fine-tuning, prompt-only, hybrid
`primaryModel`	string	`"claude-sonnet"`	Primary LLM for generation
`fallbackModel`	string	`"gpt-4o-mini"`	Cost-effective fallback
`vectorStore`	string	`"pinecone"`	Vector DB: pinecone, qdrant, weaviate, pgvector
`latencyTarget`	number	`2000`	Max response time in ms
`monthlyBudget`	number	`1000`	Monthly cost ceiling in USD

Best Practices

Start with hosted APIs and move to self-hosted only when needed — Self-hosted models (vLLM, TGI) save money at scale but require GPU infrastructure expertise. Start with Claude/GPT APIs to validate your product, then migrate to self-hosted when monthly API costs exceed $5K-10K and latency requirements are tight.
Implement semantic caching for repeated queries — In support chatbots, 30-40% of queries are semantically similar. Cache embeddings of previous queries and their responses. When a new query has >0.95 cosine similarity to a cached query, return the cached response without calling the LLM.
Route queries by complexity to different models — Use a small classifier (or even regex rules) to route simple queries ("What are your hours?") to a cheap model (GPT-4o-mini) and complex queries ("Compare pricing plans for my use case") to a capable model (Claude Sonnet). This cuts costs 40-60% with minimal quality loss.
Never trust LLM output without validation — Implement output validators for structured responses (JSON schema validation), factual claims (cross-reference with retrieved documents), and safety (content filter for harmful/off-topic content). Log all validation failures for continuous improvement.
Monitor cost per conversation, not just per token — A single conversation may span 10-20 LLM calls with growing context windows. Track the full cost per conversation to identify expensive patterns (long conversations, excessive tool calls, context window bloat) and set per-conversation cost limits.

Common Issues

RAG retrieves irrelevant documents, causing hallucination — The vector search returns documents that are semantically similar but topically irrelevant, and the LLM incorporates incorrect information. Add a reranking step (Cohere Rerank, cross-encoder) after vector search to filter by actual relevance. Set a minimum relevance threshold and return "I don't have information about that" when no documents pass the threshold.

Context window fills up in multi-turn conversations — Long conversations exceed the model's context window, causing truncation of early messages or errors. Implement conversation summarization: after every 10 turns, summarize the conversation history into a condensed format and replace the full history with the summary plus the last 3 turns.

Inference latency spikes during traffic peaks — API rate limits or self-hosted GPU contention causes response times to jump from 2s to 15s during peak hours. Implement request queuing with timeout handling, show users a "thinking" indicator, and use streaming responses so users see output appearing immediately rather than waiting for the complete response.

⚠️ Loading Issue

Guide Llm Architect

Guide LLM Architect

When to Use This Agent

Quick Start

Core Concepts

Decision Framework

RAG Architecture Patterns

Cost Optimization Strategies

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner