Advanced Observability Platform
Battle-tested skill for open, source, observability, platform. Includes structured workflows, validation checks, and reusable patterns for ai research.
LLM Observability Platform -- Comprehensive Monitoring and Tracing
Overview
A comprehensive skill for building and operating observability infrastructure for LLM-powered applications. LLM observability platforms provide end-to-end visibility into AI system behavior -- capturing traces of every LLM call, retrieval operation, embedding generation, and tool invocation. This skill covers the observability landscape including open-source tools (Langfuse, OpenLLMetry, Phoenix), commercial platforms (LangSmith, Datadog LLM Observability, Galileo), and standards-based approaches using OpenTelemetry. It provides patterns for tracing, monitoring, evaluation, cost tracking, and production alerting across any LLM framework.
When to Use
- Setting up tracing and monitoring for LLM applications in development or production
- Comparing observability platforms to select the right tool for your stack
- Implementing OpenTelemetry-based observability for vendor-neutral tracing
- Building custom dashboards for LLM metrics (latency, token usage, cost, error rates)
- Debugging complex agent workflows with multi-step traces
- Tracking evaluation metrics and model quality over time
- Implementing cost monitoring and optimization for LLM API spending
Quick Start
Option 1: Langfuse (Open Source)
pip install langfuse
from langfuse import Langfuse from langfuse.decorators import observe, langfuse_context from openai import OpenAI langfuse = Langfuse( public_key="pk-...", secret_key="sk-...", host="https://cloud.langfuse.com", # or self-hosted URL ) client = OpenAI() @observe() def answer_question(question: str) -> str: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": question}], ) answer = response.choices[0].message.content # Track generation metadata langfuse_context.update_current_observation( model="gpt-4o", usage={"input": response.usage.prompt_tokens, "output": response.usage.completion_tokens}, metadata={"temperature": 0.7}, ) return answer result = answer_question("What is observability?") langfuse.flush()
Option 2: OpenTelemetry with OpenLLMetry
pip install traceloop-sdk opentelemetry-exporter-otlp
from traceloop.sdk import Traceloop from traceloop.sdk.decorators import workflow, task from openai import OpenAI # Initialize with 2 lines -- instruments OpenAI, Anthropic, LangChain, etc. Traceloop.init( app_name="my_llm_app", api_endpoint="http://localhost:4318", # OTLP collector endpoint ) client = OpenAI() @workflow(name="qa_pipeline") def answer_question(question: str) -> str: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": question}], ) return response.choices[0].message.content
Option 3: Arize Phoenix (Local)
pip install arize-phoenix openinference-instrumentation-openai
import phoenix as px from openinference.instrumentation.openai import OpenAIInstrumentor from opentelemetry import trace as trace_api from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor # Launch Phoenix UI locally session = px.launch_app() # Set up OpenTelemetry tracing to Phoenix tracer_provider = TracerProvider() tracer_provider.add_span_processor( SimpleSpanProcessor(px.otel.SimpleSpanProcessor()) ) trace_api.set_tracer_provider(tracer_provider) # Auto-instrument OpenAI OpenAIInstrumentor().instrument() # All OpenAI calls are now traced and visible in Phoenix UI from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}], )
Core Concepts
The Four Pillars of LLM Observability
| Pillar | What It Captures | Why It Matters |
|---|---|---|
| Tracing | Input/output of every LLM call, retriever, tool | Debug failures, understand execution flow |
| Evaluation | Quality scores, correctness, relevance metrics | Measure and improve output quality |
| Monitoring | Latency, error rates, token usage, cost | Detect issues and track SLAs |
| Feedback | User ratings, annotations, human labels | Build ground truth for improvement |
Platform Comparison
| Platform | Type | Self-Host | OTEL | Pricing | Best For |
|---|---|---|---|---|---|
| Langfuse | Open Source | Yes | Yes | Free (self) / Usage-based | Teams wanting full control |
| LangSmith | Commercial | Enterprise | Yes | Free tier + usage-based | LangChain ecosystem |
| Phoenix | Open Source | Yes (local) | Yes | Free | Local development, debugging |
| OpenLLMetry | Open Source | N/A | Yes | Free | Vendor-neutral OTEL tracing |
| Datadog LLM | Commercial | No | Yes | Per-host pricing | Existing Datadog customers |
| Galileo | Commercial | No | No | Enterprise | Evaluation-first workflows |
| Opik | Open Source | Yes | Yes | Free (self) / Usage-based | Fast iteration, benchmarking |
Trace Anatomy
A trace represents a single end-to-end request through your LLM application:
Trace: "user_query_12345"
āāā Span: "router" (chain) - 50ms
ā āāā Span: "classify_intent" (llm) - 200ms
ā āāā Input: "How do I reset my password?"
ā āāā Output: "account_support"
ā āāā Model: gpt-4o-mini
ā āāā Tokens: 45 input, 3 output
āāā Span: "retriever" (retriever) - 80ms
ā āāā Query: "password reset instructions"
ā āāā Documents: 5 retrieved
ā āāā Relevance scores: [0.95, 0.89, 0.82, 0.71, 0.65]
āāā Span: "generator" (llm) - 1200ms
āāā Input: context + question
āāā Output: "To reset your password, go to..."
āāā Model: gpt-4o
āāā Tokens: 850 input, 120 output
OpenTelemetry Integration
Custom OTEL Setup
from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.resources import Resource # Configure OTEL resource resource = Resource.create({ "service.name": "llm-chatbot", "service.version": "2.1.0", "deployment.environment": "production", }) # Set up tracing with OTLP exporter tracer_provider = TracerProvider(resource=resource) otlp_exporter = OTLPSpanExporter(endpoint="http://collector:4317", insecure=True) tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter)) trace.set_tracer_provider(tracer_provider) tracer = trace.get_tracer("llm-chatbot") # Manual span creation for custom logic with tracer.start_as_current_span("llm_call") as span: span.set_attribute("llm.model", "gpt-4o") span.set_attribute("llm.token_count.prompt", 150) span.set_attribute("llm.token_count.completion", 80) span.set_attribute("llm.cost", 0.0035) response = call_llm(prompt) span.set_attribute("llm.response.finish_reason", "stop") span.set_status(trace.StatusCode.OK)
Semantic Conventions for GenAI
OpenTelemetry defines standard attribute names for GenAI observability:
| Attribute | Type | Description |
|---|---|---|
gen_ai.system | string | AI system name (e.g., "openai") |
gen_ai.request.model | string | Model identifier |
gen_ai.request.temperature | float | Sampling temperature |
gen_ai.request.max_tokens | int | Max tokens requested |
gen_ai.response.finish_reasons | string[] | Why generation stopped |
gen_ai.usage.prompt_tokens | int | Input token count |
gen_ai.usage.completion_tokens | int | Output token count |
gen_ai.response.id | string | Provider response ID |
Cost Tracking
# Model pricing (per 1K tokens, approximate) MODEL_COSTS = { "gpt-4o": {"input": 0.0025, "output": 0.01}, "gpt-4o-mini": {"input": 0.00015, "output": 0.0006}, "claude-3-5-sonnet": {"input": 0.003, "output": 0.015}, "claude-3-5-haiku": {"input": 0.0008, "output": 0.004}, } def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float: costs = MODEL_COSTS.get(model, {"input": 0, "output": 0}) return (input_tokens / 1000 * costs["input"]) + (output_tokens / 1000 * costs["output"]) # Track in Langfuse @observe() def tracked_generation(prompt: str, model: str = "gpt-4o") -> str: response = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}]) cost = calculate_cost(model, response.usage.prompt_tokens, response.usage.completion_tokens) langfuse_context.update_current_observation( model=model, usage={"input": response.usage.prompt_tokens, "output": response.usage.completion_tokens}, metadata={"cost_usd": cost}, ) return response.choices[0].message.content
Alerting and Monitoring
# Example: Prometheus metrics for LLM monitoring from prometheus_client import Counter, Histogram, Gauge llm_requests_total = Counter("llm_requests_total", "Total LLM requests", ["model", "status"]) llm_latency_seconds = Histogram("llm_latency_seconds", "LLM call latency", ["model"]) llm_token_usage = Counter("llm_token_usage_total", "Token usage", ["model", "direction"]) llm_cost_dollars = Counter("llm_cost_dollars_total", "LLM API cost", ["model"]) llm_error_rate = Gauge("llm_error_rate", "Rolling error rate", ["model"]) import time def monitored_llm_call(prompt: str, model: str = "gpt-4o") -> str: start = time.time() try: response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], ) llm_requests_total.labels(model=model, status="success").inc() llm_token_usage.labels(model=model, direction="input").inc(response.usage.prompt_tokens) llm_token_usage.labels(model=model, direction="output").inc(response.usage.completion_tokens) cost = calculate_cost(model, response.usage.prompt_tokens, response.usage.completion_tokens) llm_cost_dollars.labels(model=model).inc(cost) return response.choices[0].message.content except Exception as e: llm_requests_total.labels(model=model, status="error").inc() raise finally: llm_latency_seconds.labels(model=model).observe(time.time() - start)
Best Practices
- Trace every LLM interaction -- Instrument all LLM calls, retrieval operations, and tool invocations so that no part of the pipeline is a black box.
- Use OpenTelemetry for vendor neutrality -- Build on OTEL standards so you can switch between backends (Langfuse, Jaeger, Datadog) without changing application code.
- Track cost per request -- Attach token counts and calculated cost to every trace. Aggregate by user, feature, or model to identify cost optimization opportunities.
- Implement production sampling -- Trace 100% in development but 10-25% in production to balance observability cost with coverage.
- Build evaluation pipelines early -- Set up automated quality evaluation before production launch, not after quality issues are reported.
- Capture user feedback as ground truth -- Wire thumbs-up/down signals to your observability platform to build labeled datasets for evaluation and fine-tuning.
- Set up latency and error alerts -- Configure alerts for P95 latency spikes, error rate increases, and cost anomalies using your monitoring stack.
- Redact PII before tracing -- Strip personal information from inputs and outputs before they reach the observability backend. Compliance is non-negotiable.
- Use structured metadata for filtering -- Attach consistent metadata (user tier, feature name, model version) to traces for powerful segmentation and analysis.
- Compare models and prompts with experiments -- Use evaluation datasets to systematically compare model versions, prompt changes, and pipeline modifications before deploying.
Troubleshooting
Traces not appearing in the dashboard:
Check that the OTLP exporter endpoint is reachable and that API keys are correctly set. Use BatchSpanProcessor in production (not SimpleSpanProcessor) and call flush() before process exit.
High observability costs in production:
Implement sampling to trace only a fraction of requests. Use LANGSMITH_TRACING_SAMPLING_RATE or equivalent. For Langfuse, filter by user segment or request type.
Missing spans in complex agent traces: Ensure all functions in the call chain are instrumented. Auto-instrumentation libraries only cover supported frameworks -- custom logic needs manual span creation.
Token counts not appearing in traces:
Some auto-instrumentation libraries do not capture token usage for streaming responses. Use the response object's usage field or estimate tokens with tiktoken.
OTEL collector dropping spans: Increase the batch processor queue size and timeout. Check collector memory limits and ensure the exporter retry policy is configured for transient failures.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.