LLM Observability Platform -- Comprehensive Monitoring and Tracing

Overview

A comprehensive skill for building and operating observability infrastructure for LLM-powered applications. LLM observability platforms provide end-to-end visibility into AI system behavior -- capturing traces of every LLM call, retrieval operation, embedding generation, and tool invocation. This skill covers the observability landscape including open-source tools (Langfuse, OpenLLMetry, Phoenix), commercial platforms (LangSmith, Datadog LLM Observability, Galileo), and standards-based approaches using OpenTelemetry. It provides patterns for tracing, monitoring, evaluation, cost tracking, and production alerting across any LLM framework.

When to Use

Setting up tracing and monitoring for LLM applications in development or production
Comparing observability platforms to select the right tool for your stack
Implementing OpenTelemetry-based observability for vendor-neutral tracing
Building custom dashboards for LLM metrics (latency, token usage, cost, error rates)
Debugging complex agent workflows with multi-step traces
Tracking evaluation metrics and model quality over time
Implementing cost monitoring and optimization for LLM API spending

Quick Start

Option 1: Langfuse (Open Source)


pip install langfuse


from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
from openai import OpenAI

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com",  # or self-hosted URL
)

client = OpenAI()

@observe()
def answer_question(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    answer = response.choices[0].message.content

    # Track generation metadata
    langfuse_context.update_current_observation(
        model="gpt-4o",
        usage={"input": response.usage.prompt_tokens, "output": response.usage.completion_tokens},
        metadata={"temperature": 0.7},
    )
    return answer

result = answer_question("What is observability?")
langfuse.flush()

Option 2: OpenTelemetry with OpenLLMetry


pip install traceloop-sdk opentelemetry-exporter-otlp


from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task
from openai import OpenAI

# Initialize with 2 lines -- instruments OpenAI, Anthropic, LangChain, etc.
Traceloop.init(
    app_name="my_llm_app",
    api_endpoint="http://localhost:4318",  # OTLP collector endpoint
)

client = OpenAI()

@workflow(name="qa_pipeline")
def answer_question(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content

Option 3: Arize Phoenix (Local)


pip install arize-phoenix openinference-instrumentation-openai


import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace as trace_api
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

# Launch Phoenix UI locally
session = px.launch_app()

# Set up OpenTelemetry tracing to Phoenix
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
    SimpleSpanProcessor(px.otel.SimpleSpanProcessor())
)
trace_api.set_tracer_provider(tracer_provider)

# Auto-instrument OpenAI
OpenAIInstrumentor().instrument()

# All OpenAI calls are now traced and visible in Phoenix UI
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

Core Concepts

The Four Pillars of LLM Observability

Pillar	What It Captures	Why It Matters
Tracing	Input/output of every LLM call, retriever, tool	Debug failures, understand execution flow
Evaluation	Quality scores, correctness, relevance metrics	Measure and improve output quality
Monitoring	Latency, error rates, token usage, cost	Detect issues and track SLAs
Feedback	User ratings, annotations, human labels	Build ground truth for improvement

Platform Comparison

Platform	Type	Self-Host	OTEL	Pricing	Best For
Langfuse	Open Source	Yes	Yes	Free (self) / Usage-based	Teams wanting full control
LangSmith	Commercial	Enterprise	Yes	Free tier + usage-based	LangChain ecosystem
Phoenix	Open Source	Yes (local)	Yes	Free	Local development, debugging
OpenLLMetry	Open Source	N/A	Yes	Free	Vendor-neutral OTEL tracing
Datadog LLM	Commercial	No	Yes	Per-host pricing	Existing Datadog customers
Galileo	Commercial	No	No	Enterprise	Evaluation-first workflows
Opik	Open Source	Yes	Yes	Free (self) / Usage-based	Fast iteration, benchmarking

Trace Anatomy

A trace represents a single end-to-end request through your LLM application:

Trace: "user_query_12345"
├── Span: "router" (chain) - 50ms
│   └── Span: "classify_intent" (llm) - 200ms
│       ├── Input: "How do I reset my password?"
│       ├── Output: "account_support"
│       ├── Model: gpt-4o-mini
│       └── Tokens: 45 input, 3 output
├── Span: "retriever" (retriever) - 80ms
│   ├── Query: "password reset instructions"
│   ├── Documents: 5 retrieved
│   └── Relevance scores: [0.95, 0.89, 0.82, 0.71, 0.65]
└── Span: "generator" (llm) - 1200ms
    ├── Input: context + question
    ├── Output: "To reset your password, go to..."
    ├── Model: gpt-4o
    └── Tokens: 850 input, 120 output

OpenTelemetry Integration

Custom OTEL Setup


from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# Configure OTEL resource
resource = Resource.create({
    "service.name": "llm-chatbot",
    "service.version": "2.1.0",
    "deployment.environment": "production",
})

# Set up tracing with OTLP exporter
tracer_provider = TracerProvider(resource=resource)
otlp_exporter = OTLPSpanExporter(endpoint="http://collector:4317", insecure=True)
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(tracer_provider)

tracer = trace.get_tracer("llm-chatbot")

# Manual span creation for custom logic
with tracer.start_as_current_span("llm_call") as span:
    span.set_attribute("llm.model", "gpt-4o")
    span.set_attribute("llm.token_count.prompt", 150)
    span.set_attribute("llm.token_count.completion", 80)
    span.set_attribute("llm.cost", 0.0035)

    response = call_llm(prompt)

    span.set_attribute("llm.response.finish_reason", "stop")
    span.set_status(trace.StatusCode.OK)

Semantic Conventions for GenAI

OpenTelemetry defines standard attribute names for GenAI observability:

Attribute	Type	Description
`gen_ai.system`	string	AI system name (e.g., "openai")
`gen_ai.request.model`	string	Model identifier
`gen_ai.request.temperature`	float	Sampling temperature
`gen_ai.request.max_tokens`	int	Max tokens requested
`gen_ai.response.finish_reasons`	string[]	Why generation stopped
`gen_ai.usage.prompt_tokens`	int	Input token count
`gen_ai.usage.completion_tokens`	int	Output token count
`gen_ai.response.id`	string	Provider response ID

Cost Tracking


# Model pricing (per 1K tokens, approximate)
MODEL_COSTS = {
    "gpt-4o": {"input": 0.0025, "output": 0.01},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
    "claude-3-5-haiku": {"input": 0.0008, "output": 0.004},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    costs = MODEL_COSTS.get(model, {"input": 0, "output": 0})
    return (input_tokens / 1000 * costs["input"]) + (output_tokens / 1000 * costs["output"])

# Track in Langfuse
@observe()
def tracked_generation(prompt: str, model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(model=model, messages=[{"role": "user", "content": prompt}])
    cost = calculate_cost(model, response.usage.prompt_tokens, response.usage.completion_tokens)

    langfuse_context.update_current_observation(
        model=model,
        usage={"input": response.usage.prompt_tokens, "output": response.usage.completion_tokens},
        metadata={"cost_usd": cost},
    )
    return response.choices[0].message.content

Alerting and Monitoring


# Example: Prometheus metrics for LLM monitoring
from prometheus_client import Counter, Histogram, Gauge

llm_requests_total = Counter("llm_requests_total", "Total LLM requests", ["model", "status"])
llm_latency_seconds = Histogram("llm_latency_seconds", "LLM call latency", ["model"])
llm_token_usage = Counter("llm_token_usage_total", "Token usage", ["model", "direction"])
llm_cost_dollars = Counter("llm_cost_dollars_total", "LLM API cost", ["model"])
llm_error_rate = Gauge("llm_error_rate", "Rolling error rate", ["model"])

import time

def monitored_llm_call(prompt: str, model: str = "gpt-4o") -> str:
    start = time.time()
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )
        llm_requests_total.labels(model=model, status="success").inc()
        llm_token_usage.labels(model=model, direction="input").inc(response.usage.prompt_tokens)
        llm_token_usage.labels(model=model, direction="output").inc(response.usage.completion_tokens)

        cost = calculate_cost(model, response.usage.prompt_tokens, response.usage.completion_tokens)
        llm_cost_dollars.labels(model=model).inc(cost)

        return response.choices[0].message.content
    except Exception as e:
        llm_requests_total.labels(model=model, status="error").inc()
        raise
    finally:
        llm_latency_seconds.labels(model=model).observe(time.time() - start)

Best Practices

Trace every LLM interaction -- Instrument all LLM calls, retrieval operations, and tool invocations so that no part of the pipeline is a black box.
Use OpenTelemetry for vendor neutrality -- Build on OTEL standards so you can switch between backends (Langfuse, Jaeger, Datadog) without changing application code.
Track cost per request -- Attach token counts and calculated cost to every trace. Aggregate by user, feature, or model to identify cost optimization opportunities.
Implement production sampling -- Trace 100% in development but 10-25% in production to balance observability cost with coverage.
Build evaluation pipelines early -- Set up automated quality evaluation before production launch, not after quality issues are reported.
Capture user feedback as ground truth -- Wire thumbs-up/down signals to your observability platform to build labeled datasets for evaluation and fine-tuning.
Set up latency and error alerts -- Configure alerts for P95 latency spikes, error rate increases, and cost anomalies using your monitoring stack.
Redact PII before tracing -- Strip personal information from inputs and outputs before they reach the observability backend. Compliance is non-negotiable.
Use structured metadata for filtering -- Attach consistent metadata (user tier, feature name, model version) to traces for powerful segmentation and analysis.
Compare models and prompts with experiments -- Use evaluation datasets to systematically compare model versions, prompt changes, and pipeline modifications before deploying.

Troubleshooting

Traces not appearing in the dashboard: Check that the OTLP exporter endpoint is reachable and that API keys are correctly set. Use BatchSpanProcessor in production (not SimpleSpanProcessor) and call flush() before process exit.

High observability costs in production: Implement sampling to trace only a fraction of requests. Use LANGSMITH_TRACING_SAMPLING_RATE or equivalent. For Langfuse, filter by user segment or request type.

Missing spans in complex agent traces: Ensure all functions in the call chain are instrumented. Auto-instrumentation libraries only cover supported frameworks -- custom logic needs manual span creation.

Token counts not appearing in traces: Some auto-instrumentation libraries do not capture token usage for streaming responses. Use the response object's usage field or estimate tokens with tiktoken.

OTEL collector dropping spans: Increase the batch processor queue size and timeout. Check collector memory limits and ensure the exporter retry policy is configured for transient failures.

⚠️ Loading Issue

Advanced Observability Platform