A

Advanced Observability Platform

Battle-tested skill for open, source, observability, platform. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

LLM Observability Platform -- Comprehensive Monitoring and Tracing

Overview

A comprehensive skill for building and operating observability infrastructure for LLM-powered applications. LLM observability platforms provide end-to-end visibility into AI system behavior -- capturing traces of every LLM call, retrieval operation, embedding generation, and tool invocation. This skill covers the observability landscape including open-source tools (Langfuse, OpenLLMetry, Phoenix), commercial platforms (LangSmith, Datadog LLM Observability, Galileo), and standards-based approaches using OpenTelemetry. It provides patterns for tracing, monitoring, evaluation, cost tracking, and production alerting across any LLM framework.

When to Use

  • Setting up tracing and monitoring for LLM applications in development or production
  • Comparing observability platforms to select the right tool for your stack
  • Implementing OpenTelemetry-based observability for vendor-neutral tracing
  • Building custom dashboards for LLM metrics (latency, token usage, cost, error rates)
  • Debugging complex agent workflows with multi-step traces
  • Tracking evaluation metrics and model quality over time
  • Implementing cost monitoring and optimization for LLM API spending

Quick Start

Option 1: Langfuse (Open Source)

pip install langfuse
from langfuse import Langfuse from langfuse.decorators import observe, langfuse_context from openai import OpenAI langfuse = Langfuse( public_key="pk-...", secret_key="sk-...", host="https://cloud.langfuse.com", # or self-hosted URL ) client = OpenAI() @observe() def answer_question(question: str) -> str: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": question}], ) answer = response.choices[0].message.content # Track generation metadata langfuse_context.update_current_observation( model="gpt-4o", usage={"input": response.usage.prompt_tokens, "output": response.usage.completion_tokens}, metadata={"temperature": 0.7}, ) return answer result = answer_question("What is observability?") langfuse.flush()

Option 2: OpenTelemetry with OpenLLMetry

pip install traceloop-sdk opentelemetry-exporter-otlp
from traceloop.sdk import Traceloop from traceloop.sdk.decorators import workflow, task from openai import OpenAI # Initialize with 2 lines -- instruments OpenAI, Anthropic, LangChain, etc. Traceloop.init( app_name="my_llm_app", api_endpoint="http://localhost:4318", # OTLP collector endpoint ) client = OpenAI() @workflow(name="qa_pipeline") def answer_question(question: str) -> str: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": question}], ) return response.choices[0].message.content

Option 3: Arize Phoenix (Local)

pip install arize-phoenix openinference-instrumentation-openai
import phoenix as px from openinference.instrumentation.openai import OpenAIInstrumentor from opentelemetry import trace as trace_api from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor # Launch Phoenix UI locally session = px.launch_app() # Set up OpenTelemetry tracing to Phoenix tracer_provider = TracerProvider() tracer_provider.add_span_processor( SimpleSpanProcessor(px.otel.SimpleSpanProcessor()) ) trace_api.set_tracer_provider(tracer_provider) # Auto-instrument OpenAI OpenAIInstrumentor().instrument() # All OpenAI calls are now traced and visible in Phoenix UI from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}], )

Core Concepts

The Four Pillars of LLM Observability

PillarWhat It CapturesWhy It Matters
TracingInput/output of every LLM call, retriever, toolDebug failures, understand execution flow
EvaluationQuality scores, correctness, relevance metricsMeasure and improve output quality
MonitoringLatency, error rates, token usage, costDetect issues and track SLAs
FeedbackUser ratings, annotations, human labelsBuild ground truth for improvement

Platform Comparison

PlatformTypeSelf-HostOTELPricingBest For
LangfuseOpen SourceYesYesFree (self) / Usage-basedTeams wanting full control
LangSmithCommercialEnterpriseYesFree tier + usage-basedLangChain ecosystem
PhoenixOpen SourceYes (local)YesFreeLocal development, debugging
OpenLLMetryOpen SourceN/AYesFreeVendor-neutral OTEL tracing
Datadog LLMCommercialNoYesPer-host pricingExisting Datadog customers
GalileoCommercialNoNoEnterpriseEvaluation-first workflows
OpikOpen SourceYesYesFree (self) / Usage-basedFast iteration, benchmarking

Trace Anatomy

A trace represents a single end-to-end request through your LLM application:

Trace: "user_query_12345"
ā”œā”€ā”€ Span: "router" (chain) - 50ms
│   └── Span: "classify_intent" (llm) - 200ms
│       ā”œā”€ā”€ Input: "How do I reset my password?"
│       ā”œā”€ā”€ Output: "account_support"
│       ā”œā”€ā”€ Model: gpt-4o-mini
│       └── Tokens: 45 input, 3 output
ā”œā”€ā”€ Span: "retriever" (retriever) - 80ms
│   ā”œā”€ā”€ Query: "password reset instructions"
│   ā”œā”€ā”€ Documents: 5 retrieved
│   └── Relevance scores: [0.95, 0.89, 0.82, 0.71, 0.65]
└── Span: "generator" (llm) - 1200ms
    ā”œā”€ā”€ Input: context + question
    ā”œā”€ā”€ Output: "To reset your password, go to..."
    ā”œā”€ā”€ Model: gpt-4o
    └── Tokens: 850 input, 120 output

OpenTelemetry Integration

Custom OTEL Setup

from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.resources import Resource # Configure OTEL resource resource = Resource.create({ "service.name": "llm-chatbot", "service.version": "2.1.0", "deployment.environment": "production", }) # Set up tracing with OTLP exporter tracer_provider = TracerProvider(resource=resource) otlp_exporter = OTLPSpanExporter(endpoint="http://collector:4317", insecure=True) tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter)) trace.set_tracer_provider(tracer_provider) tracer = trace.get_tracer("llm-chatbot") # Manual span creation for custom logic with tracer.start_as_current_span("llm_call") as span: span.set_attribute("llm.model", "gpt-4o") span.set_attribute("llm.token_count.prompt", 150) span.set_attribute("llm.token_count.completion", 80) span.set_attribute("llm.cost", 0.0035) response = call_llm(prompt) span.set_attribute("llm.response.finish_reason", "stop") span.set_status(trace.StatusCode.OK)

Semantic Conventions for GenAI

OpenTelemetry defines standard attribute names for GenAI observability:

AttributeTypeDescription
gen_ai.systemstringAI system name (e.g., "openai")
gen_ai.request.modelstringModel identifier
gen_ai.request.temperaturefloatSampling temperature
gen_ai.request.max_tokensintMax tokens requested
gen_ai.response.finish_reasonsstring[]Why generation stopped
gen_ai.usage.prompt_tokensintInput token count
gen_ai.usage.completion_tokensintOutput token count
gen_ai.response.idstringProvider response ID

Cost Tracking

# Model pricing (per 1K tokens, approximate) MODEL_COSTS = { "gpt-4o": {"input": 0.0025, "output": 0.01}, "gpt-4o-mini": {"input": 0.00015, "output": 0.0006}, "claude-3-5-sonnet": {"input": 0.003, "output": 0.015}, "claude-3-5-haiku": {"input": 0.0008, "output": 0.004}, } def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float: costs = MODEL_COSTS.get(model, {"input": 0, "output": 0}) return (input_tokens / 1000 * costs["input"]) + (output_tokens / 1000 * costs["output"]) # Track in Langfuse @observe() def tracked_generation(prompt: str, model: str = "gpt-4o") -> str: response = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}]) cost = calculate_cost(model, response.usage.prompt_tokens, response.usage.completion_tokens) langfuse_context.update_current_observation( model=model, usage={"input": response.usage.prompt_tokens, "output": response.usage.completion_tokens}, metadata={"cost_usd": cost}, ) return response.choices[0].message.content

Alerting and Monitoring

# Example: Prometheus metrics for LLM monitoring from prometheus_client import Counter, Histogram, Gauge llm_requests_total = Counter("llm_requests_total", "Total LLM requests", ["model", "status"]) llm_latency_seconds = Histogram("llm_latency_seconds", "LLM call latency", ["model"]) llm_token_usage = Counter("llm_token_usage_total", "Token usage", ["model", "direction"]) llm_cost_dollars = Counter("llm_cost_dollars_total", "LLM API cost", ["model"]) llm_error_rate = Gauge("llm_error_rate", "Rolling error rate", ["model"]) import time def monitored_llm_call(prompt: str, model: str = "gpt-4o") -> str: start = time.time() try: response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], ) llm_requests_total.labels(model=model, status="success").inc() llm_token_usage.labels(model=model, direction="input").inc(response.usage.prompt_tokens) llm_token_usage.labels(model=model, direction="output").inc(response.usage.completion_tokens) cost = calculate_cost(model, response.usage.prompt_tokens, response.usage.completion_tokens) llm_cost_dollars.labels(model=model).inc(cost) return response.choices[0].message.content except Exception as e: llm_requests_total.labels(model=model, status="error").inc() raise finally: llm_latency_seconds.labels(model=model).observe(time.time() - start)

Best Practices

  1. Trace every LLM interaction -- Instrument all LLM calls, retrieval operations, and tool invocations so that no part of the pipeline is a black box.
  2. Use OpenTelemetry for vendor neutrality -- Build on OTEL standards so you can switch between backends (Langfuse, Jaeger, Datadog) without changing application code.
  3. Track cost per request -- Attach token counts and calculated cost to every trace. Aggregate by user, feature, or model to identify cost optimization opportunities.
  4. Implement production sampling -- Trace 100% in development but 10-25% in production to balance observability cost with coverage.
  5. Build evaluation pipelines early -- Set up automated quality evaluation before production launch, not after quality issues are reported.
  6. Capture user feedback as ground truth -- Wire thumbs-up/down signals to your observability platform to build labeled datasets for evaluation and fine-tuning.
  7. Set up latency and error alerts -- Configure alerts for P95 latency spikes, error rate increases, and cost anomalies using your monitoring stack.
  8. Redact PII before tracing -- Strip personal information from inputs and outputs before they reach the observability backend. Compliance is non-negotiable.
  9. Use structured metadata for filtering -- Attach consistent metadata (user tier, feature name, model version) to traces for powerful segmentation and analysis.
  10. Compare models and prompts with experiments -- Use evaluation datasets to systematically compare model versions, prompt changes, and pipeline modifications before deploying.

Troubleshooting

Traces not appearing in the dashboard: Check that the OTLP exporter endpoint is reachable and that API keys are correctly set. Use BatchSpanProcessor in production (not SimpleSpanProcessor) and call flush() before process exit.

High observability costs in production: Implement sampling to trace only a fraction of requests. Use LANGSMITH_TRACING_SAMPLING_RATE or equivalent. For Langfuse, filter by user segment or request type.

Missing spans in complex agent traces: Ensure all functions in the call chain are instrumented. Auto-instrumentation libraries only cover supported frameworks -- custom logic needs manual span creation.

Token counts not appearing in traces: Some auto-instrumentation libraries do not capture token usage for streaming responses. Use the response object's usage field or estimate tokens with tiktoken.

OTEL collector dropping spans: Increase the batch processor queue size and timeout. Check collector memory limits and ensure the exporter retry policy is configured for transient failures.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates