G

Guide Llm Architect

Production-ready agent that handles designing, systems, production, implementing. Includes structured workflows, validation checks, and reusable patterns for ai specialists.

AgentClipticsai specialistsv1.0.0MIT
0 views0 copies

Guide LLM Architect

An autonomous agent that designs production-ready LLM systems β€” selecting models, implementing RAG pipelines, optimizing inference costs, and deploying with safety guardrails, monitoring, and auto-scaling.

When to Use This Agent

Choose Guide LLM Architect when:

  • You are building an application powered by LLMs and need architecture guidance
  • You need to choose between fine-tuning, RAG, and prompt engineering approaches
  • Inference costs are growing and you need optimization strategies
  • You want production deployment with safety filters, monitoring, and fallbacks

Consider alternatives when:

  • You need to train a model from scratch (use an ML engineer agent)
  • Your focus is purely on prompt crafting (use a prompt engineer agent)
  • You need general backend architecture without LLM components

Quick Start

# .claude/agents/llm-architect.yml name: guide-llm-architect description: Design and deploy production LLM systems agent_prompt: | You are an LLM Architect. When designing LLM systems: 1. Analyze requirements: latency, accuracy, cost, scale 2. Recommend model selection (hosted API vs self-hosted) 3. Design the retrieval pipeline (RAG, fine-tuning, or hybrid) 4. Implement safety: content filtering, injection defense, output validation 5. Set up monitoring: latency, cost, quality, drift 6. Plan scaling: auto-scale, caching, batching strategies Always start with the simplest approach that meets requirements. Hosted APIs before self-hosted. RAG before fine-tuning.

Example invocation:

claude "Design an LLM architecture for a customer support chatbot handling 10K conversations/day with our product documentation"

Sample architecture output:

LLM Architecture β€” Customer Support Chatbot
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Recommended Approach: RAG + Hosted API

Model: Claude 3.5 Sonnet (best quality/cost for support)
  Fallback: GPT-4o-mini (for cost reduction on simple queries)

RAG Pipeline:
  Documents β†’ Chunking (512 tokens, 50 overlap)
    β†’ Embedding (text-embedding-3-small)
    β†’ Vector Store (Pinecone, 1536 dims)
    β†’ Retrieval (top-5 + reranking with Cohere)
    β†’ Generation (Claude with retrieved context)

Cost Estimate: $850/month at 10K conversations/day
  Embedding: $15/month (one-time + incremental)
  Vector DB: $70/month (Pinecone starter)
  LLM calls: $720/month (avg 2K tokens/conversation)
  Reranking: $45/month

Architecture:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  User   │───→│ Gateway  │───→│ Router    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ (safety) β”‚    β”‚ (simple/  β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  complex) β”‚
                                 β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
                     β”‚ GPT-4o-   β”‚          β”‚ Claude +  β”‚
                     β”‚ mini      β”‚          β”‚ RAG       β”‚
                     β”‚ (simple)  β”‚          β”‚ (complex) β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Concepts

Decision Framework

RequirementUse RAGUse Fine-TuningUse Prompt Engineering
Domain knowledgeCompany docs, FAQsSpecialized terminology, styleGeneral knowledge
Update frequencyDaily/weekly changesRarely changesStatic instructions
Data volume100s-1000s of docs1000s of examples5-20 examples
Latency budget+200-500ms acceptableSame as base modelFastest
CostPer-query retrieval costOne-time training costNo extra cost

RAG Architecture Patterns

// Production RAG pipeline class RAGPipeline { constructor( private vectorStore: VectorStore, private llm: LLMClient, private reranker: Reranker ) {} async query(userQuery: string, conversationHistory: Message[]): Promise<string> { // 1. Query expansion for better retrieval const expandedQueries = await this.expandQuery(userQuery, conversationHistory); // 2. Retrieve from vector store const candidates = await this.vectorStore.search(expandedQueries, { topK: 20 }); // 3. Rerank for relevance const reranked = await this.reranker.rerank(userQuery, candidates, { topK: 5 }); // 4. Build context with source attribution const context = this.buildContext(reranked); // 5. Generate with retrieved context const response = await this.llm.generate({ system: SYSTEM_PROMPT, messages: [ ...conversationHistory, { role: "user", content: `Context:\n${context}\n\nQuestion: ${userQuery}` } ], temperature: 0.3 // Lower for factual accuracy }); return response; } private buildContext(documents: Document[]): string { return documents .map((doc, i) => `[Source ${i + 1}: ${doc.title}]\n${doc.content}`) .join('\n\n'); } }

Cost Optimization Strategies

Cost Reduction Techniques:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Technique              β”‚ Savings     β”‚ Trade-off    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Query routing (simple/ β”‚ 40-60%      β”‚ Routing      β”‚
β”‚ complex model split)   β”‚             β”‚ accuracy     β”‚
β”‚ Semantic caching       β”‚ 20-40%      β”‚ Cache misses β”‚
β”‚ Prompt compression     β”‚ 15-25%      β”‚ Minor qualityβ”‚
β”‚ Batch processing       β”‚ 10-30%      β”‚ Latency      β”‚
β”‚ Output length limits   β”‚ 10-20%      β”‚ Truncation   β”‚
β”‚ 4-bit quantization     β”‚ 50-75%      β”‚ Slight qual. β”‚
β”‚ (self-hosted)          β”‚             β”‚ reduction    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

OptionTypeDefaultDescription
approachstring"rag"Architecture: rag, fine-tuning, prompt-only, hybrid
primaryModelstring"claude-sonnet"Primary LLM for generation
fallbackModelstring"gpt-4o-mini"Cost-effective fallback
vectorStorestring"pinecone"Vector DB: pinecone, qdrant, weaviate, pgvector
latencyTargetnumber2000Max response time in ms
monthlyBudgetnumber1000Monthly cost ceiling in USD

Best Practices

  1. Start with hosted APIs and move to self-hosted only when needed β€” Self-hosted models (vLLM, TGI) save money at scale but require GPU infrastructure expertise. Start with Claude/GPT APIs to validate your product, then migrate to self-hosted when monthly API costs exceed $5K-10K and latency requirements are tight.

  2. Implement semantic caching for repeated queries β€” In support chatbots, 30-40% of queries are semantically similar. Cache embeddings of previous queries and their responses. When a new query has >0.95 cosine similarity to a cached query, return the cached response without calling the LLM.

  3. Route queries by complexity to different models β€” Use a small classifier (or even regex rules) to route simple queries ("What are your hours?") to a cheap model (GPT-4o-mini) and complex queries ("Compare pricing plans for my use case") to a capable model (Claude Sonnet). This cuts costs 40-60% with minimal quality loss.

  4. Never trust LLM output without validation β€” Implement output validators for structured responses (JSON schema validation), factual claims (cross-reference with retrieved documents), and safety (content filter for harmful/off-topic content). Log all validation failures for continuous improvement.

  5. Monitor cost per conversation, not just per token β€” A single conversation may span 10-20 LLM calls with growing context windows. Track the full cost per conversation to identify expensive patterns (long conversations, excessive tool calls, context window bloat) and set per-conversation cost limits.

Common Issues

RAG retrieves irrelevant documents, causing hallucination β€” The vector search returns documents that are semantically similar but topically irrelevant, and the LLM incorporates incorrect information. Add a reranking step (Cohere Rerank, cross-encoder) after vector search to filter by actual relevance. Set a minimum relevance threshold and return "I don't have information about that" when no documents pass the threshold.

Context window fills up in multi-turn conversations β€” Long conversations exceed the model's context window, causing truncation of early messages or errors. Implement conversation summarization: after every 10 turns, summarize the conversation history into a condensed format and replace the full history with the summary plus the last 3 turns.

Inference latency spikes during traffic peaks β€” API rate limits or self-hosted GPU contention causes response times to jump from 2s to 15s during peak hours. Implement request queuing with timeout handling, show users a "thinking" indicator, and use streaming responses so users see output appearing immediately rather than waiting for the complete response.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates