AI Product Toolkit

Production-ready guide for shipping LLM-powered features — covering prompt engineering for products, cost optimization, safety systems, evaluation frameworks, and the gap between demos and production.

When to Use

Use this toolkit when:

Adding AI/LLM features to an existing product
Building an AI-native application from scratch
Need to optimize LLM costs for production scale
Building evaluation and safety systems for AI features

Not for:

Pure ML research without product context
Fine-tuning or training models (see post-training tools)
Non-LLM AI (computer vision, recommendation systems)

Quick Start

Production LLM Integration


import anthropic
from tenacity import retry, stop_after_attempt, wait_exponential

client = anthropic.Anthropic()

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def generate_response(user_input, context):
    """Production-grade LLM call with retry and error handling."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="You are a helpful product assistant for [product name].",
        messages=[
            {"role": "user", "content": f"Context: {context}\n\nUser: {user_input}"}
        ]
    )
    return response.content[0].text

# With input validation
def handle_user_query(user_input):
    if len(user_input) > 10000:
        return "Please keep your message under 10,000 characters."

    if not user_input.strip():
        return "Please enter a message."

    response = generate_response(user_input, get_context(user_input))
    return validate_output(response)

Cost Optimization


class LLMCostOptimizer:
    MODEL_TIERS = {
        "simple": {"model": "claude-haiku-4-5-20251001", "cost_per_1k": 0.00025},
        "moderate": {"model": "claude-sonnet-4-20250514", "cost_per_1k": 0.003},
        "complex": {"model": "claude-opus-4-20250514", "cost_per_1k": 0.015},
    }

    def route_query(self, query, complexity_score):
        """Route to cheapest model that can handle the task."""
        if complexity_score < 0.3:
            return self.MODEL_TIERS["simple"]
        elif complexity_score < 0.7:
            return self.MODEL_TIERS["moderate"]
        else:
            return self.MODEL_TIERS["complex"]

Core Concepts

Demo vs Production Gap

Aspect	Demo	Production
Input quality	Clean, expected	Messy, adversarial, edge cases
Latency	Doesn't matter	< 2 seconds expected
Cost	Negligible	$$$$ at scale
Errors	Manual retry	Graceful degradation required
Safety	Trusted users	Adversarial users
Monitoring	None	Logs, alerts, dashboards

Evaluation Framework

Metric	Measurement	Target
Task completion	User successfully completes intent	> 85%
Factual accuracy	Correct information in response	> 95%
Latency P95	95th percentile response time	< 3s
Cost per query	Average API cost	< $0.01
Safety rate	Responses passing safety checks	> 99.9%
User satisfaction	Thumbs up/down ratio	> 80% positive

Configuration

Parameter	Description
`primary_model`	Default LLM model
`fallback_model`	Backup if primary fails
`max_tokens`	Response length limit
`temperature`	Creativity vs determinism
`timeout_ms`	Maximum response time
`retry_count`	Attempts before fallback
`rate_limit`	Requests per minute per user
`cost_budget`	Monthly spending limit

Best Practices

Ship with the simplest model that works — upgrade only when users need more capability
Build evaluation before features — you can't improve what you can't measure
Add safety from day one — retrofitting safety is harder and more expensive
Use model tiering — route simple queries to cheaper models
Design for graceful degradation — show a helpful fallback when the LLM fails
Log everything — prompts, responses, latency, and user feedback for continuous improvement

Common Issues

LLM costs are too high: Implement model tiering (route 60-70% of queries to cheaper models). Add response caching for repeated queries. Reduce max_tokens where possible.

Latency is too high for UX: Use streaming responses. Switch to faster models (Haiku for simple tasks). Pre-compute responses for predictable queries.

Users find inaccurate responses: Add retrieval (RAG) to ground responses in your data. Implement output validation. Add confidence scoring and show caveats for low-confidence answers.

⚠️ Loading Issue

Ai Product Toolkit

AI Product Toolkit

When to Use

Quick Start

Production LLM Integration

Cost Optimization

Core Concepts

Demo vs Production Gap

Evaluation Framework

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace