Guide Llm Architect
Production-ready agent that handles designing, systems, production, implementing. Includes structured workflows, validation checks, and reusable patterns for ai specialists.
Guide LLM Architect
An autonomous agent that designs production-ready LLM systems β selecting models, implementing RAG pipelines, optimizing inference costs, and deploying with safety guardrails, monitoring, and auto-scaling.
When to Use This Agent
Choose Guide LLM Architect when:
- You are building an application powered by LLMs and need architecture guidance
- You need to choose between fine-tuning, RAG, and prompt engineering approaches
- Inference costs are growing and you need optimization strategies
- You want production deployment with safety filters, monitoring, and fallbacks
Consider alternatives when:
- You need to train a model from scratch (use an ML engineer agent)
- Your focus is purely on prompt crafting (use a prompt engineer agent)
- You need general backend architecture without LLM components
Quick Start
# .claude/agents/llm-architect.yml name: guide-llm-architect description: Design and deploy production LLM systems agent_prompt: | You are an LLM Architect. When designing LLM systems: 1. Analyze requirements: latency, accuracy, cost, scale 2. Recommend model selection (hosted API vs self-hosted) 3. Design the retrieval pipeline (RAG, fine-tuning, or hybrid) 4. Implement safety: content filtering, injection defense, output validation 5. Set up monitoring: latency, cost, quality, drift 6. Plan scaling: auto-scale, caching, batching strategies Always start with the simplest approach that meets requirements. Hosted APIs before self-hosted. RAG before fine-tuning.
Example invocation:
claude "Design an LLM architecture for a customer support chatbot handling 10K conversations/day with our product documentation"
Sample architecture output:
LLM Architecture β Customer Support Chatbot
βββββββββββββββββββββββββββββββββββββββββ
Recommended Approach: RAG + Hosted API
Model: Claude 3.5 Sonnet (best quality/cost for support)
Fallback: GPT-4o-mini (for cost reduction on simple queries)
RAG Pipeline:
Documents β Chunking (512 tokens, 50 overlap)
β Embedding (text-embedding-3-small)
β Vector Store (Pinecone, 1536 dims)
β Retrieval (top-5 + reranking with Cohere)
β Generation (Claude with retrieved context)
Cost Estimate: $850/month at 10K conversations/day
Embedding: $15/month (one-time + incremental)
Vector DB: $70/month (Pinecone starter)
LLM calls: $720/month (avg 2K tokens/conversation)
Reranking: $45/month
Architecture:
βββββββββββ ββββββββββββ βββββββββββββ
β User ββββββ Gateway ββββββ Router β
βββββββββββ β (safety) β β (simple/ β
ββββββββββββ β complex) β
βββββββ¬ββββββ
βββββββββββββ΄βββββββββββ
βββββββΌββββββ βββββββΌββββββ
β GPT-4o- β β Claude + β
β mini β β RAG β
β (simple) β β (complex) β
βββββββββββββ βββββββββββββ
Core Concepts
Decision Framework
| Requirement | Use RAG | Use Fine-Tuning | Use Prompt Engineering |
|---|---|---|---|
| Domain knowledge | Company docs, FAQs | Specialized terminology, style | General knowledge |
| Update frequency | Daily/weekly changes | Rarely changes | Static instructions |
| Data volume | 100s-1000s of docs | 1000s of examples | 5-20 examples |
| Latency budget | +200-500ms acceptable | Same as base model | Fastest |
| Cost | Per-query retrieval cost | One-time training cost | No extra cost |
RAG Architecture Patterns
// Production RAG pipeline class RAGPipeline { constructor( private vectorStore: VectorStore, private llm: LLMClient, private reranker: Reranker ) {} async query(userQuery: string, conversationHistory: Message[]): Promise<string> { // 1. Query expansion for better retrieval const expandedQueries = await this.expandQuery(userQuery, conversationHistory); // 2. Retrieve from vector store const candidates = await this.vectorStore.search(expandedQueries, { topK: 20 }); // 3. Rerank for relevance const reranked = await this.reranker.rerank(userQuery, candidates, { topK: 5 }); // 4. Build context with source attribution const context = this.buildContext(reranked); // 5. Generate with retrieved context const response = await this.llm.generate({ system: SYSTEM_PROMPT, messages: [ ...conversationHistory, { role: "user", content: `Context:\n${context}\n\nQuestion: ${userQuery}` } ], temperature: 0.3 // Lower for factual accuracy }); return response; } private buildContext(documents: Document[]): string { return documents .map((doc, i) => `[Source ${i + 1}: ${doc.title}]\n${doc.content}`) .join('\n\n'); } }
Cost Optimization Strategies
Cost Reduction Techniques:
ββββββββββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ
β Technique β Savings β Trade-off β
ββββββββββββββββββββββββββΌββββββββββββββΌβββββββββββββββ€
β Query routing (simple/ β 40-60% β Routing β
β complex model split) β β accuracy β
β Semantic caching β 20-40% β Cache misses β
β Prompt compression β 15-25% β Minor qualityβ
β Batch processing β 10-30% β Latency β
β Output length limits β 10-20% β Truncation β
β 4-bit quantization β 50-75% β Slight qual. β
β (self-hosted) β β reduction β
ββββββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ
Configuration
| Option | Type | Default | Description |
|---|---|---|---|
approach | string | "rag" | Architecture: rag, fine-tuning, prompt-only, hybrid |
primaryModel | string | "claude-sonnet" | Primary LLM for generation |
fallbackModel | string | "gpt-4o-mini" | Cost-effective fallback |
vectorStore | string | "pinecone" | Vector DB: pinecone, qdrant, weaviate, pgvector |
latencyTarget | number | 2000 | Max response time in ms |
monthlyBudget | number | 1000 | Monthly cost ceiling in USD |
Best Practices
-
Start with hosted APIs and move to self-hosted only when needed β Self-hosted models (vLLM, TGI) save money at scale but require GPU infrastructure expertise. Start with Claude/GPT APIs to validate your product, then migrate to self-hosted when monthly API costs exceed $5K-10K and latency requirements are tight.
-
Implement semantic caching for repeated queries β In support chatbots, 30-40% of queries are semantically similar. Cache embeddings of previous queries and their responses. When a new query has >0.95 cosine similarity to a cached query, return the cached response without calling the LLM.
-
Route queries by complexity to different models β Use a small classifier (or even regex rules) to route simple queries ("What are your hours?") to a cheap model (GPT-4o-mini) and complex queries ("Compare pricing plans for my use case") to a capable model (Claude Sonnet). This cuts costs 40-60% with minimal quality loss.
-
Never trust LLM output without validation β Implement output validators for structured responses (JSON schema validation), factual claims (cross-reference with retrieved documents), and safety (content filter for harmful/off-topic content). Log all validation failures for continuous improvement.
-
Monitor cost per conversation, not just per token β A single conversation may span 10-20 LLM calls with growing context windows. Track the full cost per conversation to identify expensive patterns (long conversations, excessive tool calls, context window bloat) and set per-conversation cost limits.
Common Issues
RAG retrieves irrelevant documents, causing hallucination β The vector search returns documents that are semantically similar but topically irrelevant, and the LLM incorporates incorrect information. Add a reranking step (Cohere Rerank, cross-encoder) after vector search to filter by actual relevance. Set a minimum relevance threshold and return "I don't have information about that" when no documents pass the threshold.
Context window fills up in multi-turn conversations β Long conversations exceed the model's context window, causing truncation of early messages or errors. Implement conversation summarization: after every 10 turns, summarize the conversation history into a condensed format and replace the full history with the summary plus the last 3 turns.
Inference latency spikes during traffic peaks β API rate limits or self-hosted GPU contention causes response times to jump from 2s to 15s during peak hours. Implement request queuing with timeout handling, show users a "thinking" indicator, and use streaming responses so users see output appearing immediately rather than waiting for the complete response.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
API Endpoint Builder
Agent that scaffolds complete REST API endpoints with controller, service, route, types, and tests. Supports Express, Fastify, and NestJS.
Documentation Auto-Generator
Agent that reads your codebase and generates comprehensive documentation including API docs, architecture guides, and setup instructions.
Ai Ethics Advisor Partner
All-in-one agent covering ethics, responsible, development, specialist. Includes structured workflows, validation checks, and reusable patterns for ai specialists.