Ultimate Observability Langsmith
Powerful skill for observability, platform, tracing, evaluation. Includes structured workflows, validation checks, and reusable patterns for ai research.
LangSmith -- LLM Observability and Evaluation Platform
Overview
A comprehensive skill for debugging, evaluating, and monitoring language model applications using LangSmith. LangSmith provides end-to-end observability for LLM-powered systems -- capturing detailed traces of every LLM call, retrieval operation, and tool invocation across your application. It enables systematic evaluation against curated datasets, production monitoring with cost and latency tracking, and collaborative prompt engineering through the Prompt Hub. LangSmith integrates natively with LangChain but also works as a standalone tracing platform for OpenAI, Anthropic, and any custom LLM pipeline through the @traceable decorator and client wrappers.
When to Use
- Debugging LLM application issues by inspecting full execution traces
- Evaluating model outputs systematically against labeled datasets
- Monitoring production LLM systems for latency, errors, token usage, and cost
- Building regression test suites for AI features before deployment
- Collaborating on prompt engineering with version-controlled prompts
- Comparing model performance across experiments and configurations
- Tracing complex agent workflows with tool calls, retrieval, and multi-step reasoning
Quick Start
# Install pip install langsmith # Set environment variables export LANGSMITH_API_KEY="your-api-key" export LANGSMITH_TRACING=true
from langsmith import traceable from openai import OpenAI client = OpenAI() @traceable def answer_question(question: str) -> str: """Every call to this function is automatically traced to LangSmith.""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": question}], ) return response.choices[0].message.content # Call the function -- trace appears in LangSmith dashboard result = answer_question("What is retrieval-augmented generation?")
Core Concepts
Runs and Traces
A run is a single unit of execution -- an LLM call, a retriever query, or a tool invocation. Runs are organized into hierarchical traces that show the full execution flow of a request.
from langsmith import traceable @traceable(run_type="chain") def process_query(query: str) -> str: """Parent run: the full pipeline.""" context = retrieve_context(query) # Child run 1 answer = generate_answer(query, context) # Child run 2 return answer @traceable(run_type="retriever") def retrieve_context(query: str) -> list[str]: """Traced as a retriever run.""" return vector_store.similarity_search(query, k=5) @traceable(run_type="llm") def generate_answer(query: str, context: list[str]) -> str: """Traced as an LLM run with inputs/outputs captured.""" response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"Context: {context}"}, {"role": "user", "content": query}, ], ) return response.choices[0].message.content
Automatic Tracing with Client Wrappers
from langsmith.wrappers import wrap_openai from openai import OpenAI # Wrap the OpenAI client for automatic tracing client = wrap_openai(OpenAI()) # All calls are traced without decorators response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}], )
Projects and Organization
import os # Set project globally os.environ["LANGSMITH_PROJECT"] = "production-chatbot" # Or per-function @traceable(project_name="experiment-v2") def experimental_pipeline(query: str) -> str: pass # Or with context manager from langsmith import tracing_context with tracing_context( project_name="a-b-test", tags=["variant-a", "gpt-4o"], metadata={"version": "2.1", "team": "search"}, ): result = process_query("How does RAG work?")
Datasets and Evaluation
Creating Datasets
from langsmith import Client client = Client() # Create a dataset dataset = client.create_dataset( "qa-evaluation-set", description="Golden QA pairs for regression testing", ) # Add labeled examples client.create_examples( inputs=[ {"question": "What is Python?"}, {"question": "Explain gradient descent."}, {"question": "What is a transformer?"}, ], outputs=[ {"answer": "A high-level, general-purpose programming language."}, {"answer": "An optimization algorithm that iteratively adjusts parameters."}, {"answer": "A neural network architecture based on self-attention."}, ], dataset_id=dataset.id, ) # Create examples from production traces runs = client.list_runs(project_name="production-chatbot", limit=50) for run in runs: if run.feedback_stats and run.feedback_stats.get("correctness", {}).get("avg", 0) > 0.8: client.create_example( inputs=run.inputs, outputs=run.outputs, dataset_id=dataset.id, )
Running Evaluations
from langsmith import evaluate # Define your target function def my_chatbot(inputs: dict) -> dict: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": inputs["question"]}], ) return {"answer": response.choices[0].message.content} # Define custom evaluators def answer_relevance(run, example) -> dict: """Check if the answer addresses the question.""" prediction = run.outputs["answer"] reference = example.outputs["answer"] # Simple keyword overlap score pred_words = set(prediction.lower().split()) ref_words = set(reference.lower().split()) overlap = len(pred_words & ref_words) / max(len(ref_words), 1) return {"key": "relevance", "score": min(overlap * 2, 1.0)} def answer_length(run, example) -> dict: """Check answer is not too short or too long.""" length = len(run.outputs["answer"]) score = 1.0 if 50 < length < 500 else 0.5 return {"key": "length_check", "score": score} # Run evaluation results = evaluate( my_chatbot, data="qa-evaluation-set", evaluators=[answer_relevance, answer_length], experiment_prefix="gpt-4o-baseline", max_concurrency=4, ) print(f"Relevance: {results.aggregate_metrics['relevance']:.3f}") print(f"Length check: {results.aggregate_metrics['length_check']:.3f}")
LLM-as-Judge Evaluation
from langsmith.evaluation import LangChainStringEvaluator # Use LLM-based evaluators for nuanced assessment results = evaluate( my_chatbot, data="qa-evaluation-set", evaluators=[ LangChainStringEvaluator("qa"), # QA correctness LangChainStringEvaluator("cot_qa"), # Chain-of-thought QA LangChainStringEvaluator("helpfulness"), # Overall helpfulness ], experiment_prefix="gpt-4o-llm-judge", )
Client API
from langsmith import Client client = Client() # List recent runs with filters runs = list(client.list_runs( project_name="production-chatbot", filter='and(eq(status, "success"), gt(latency, 5))', limit=100, )) # Attach human feedback to a run client.create_feedback( run_id=run.id, key="correctness", score=0.9, comment="Accurate answer but missing one detail.", ) # Read run details run = client.read_run(run_id="run-uuid-here") print(f"Latency: {run.latency:.2f}s, Tokens: {run.total_tokens}")
Advanced Tracing
Sanitizing Sensitive Data
def redact_pii(inputs: dict) -> dict: """Remove sensitive fields before logging.""" sanitized = inputs.copy() for key in ["password", "ssn", "credit_card", "api_key"]: if key in sanitized: sanitized[key] = "[REDACTED]" return sanitized @traceable(process_inputs=redact_pii) def authenticate_user(username: str, password: str) -> bool: # password is redacted in LangSmith traces return check_credentials(username, password)
Manual Run Management
from langsmith import trace with trace( name="custom_retrieval", run_type="retriever", inputs={"query": "machine learning basics"}, ) as run: results = perform_search("machine learning basics") run.end(outputs={"documents": results, "count": len(results)})
Production Sampling
import os # Trace only 10% of production requests to reduce cost os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"
LangChain Integration
from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser # With LANGSMITH_TRACING=true, all LangChain runs are automatically traced llm = ChatOpenAI(model="gpt-4o") prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful coding assistant."), ("user", "{question}"), ]) chain = prompt | llm | StrOutputParser() # Full chain trace visible in LangSmith: prompt -> LLM -> parser response = chain.invoke({"question": "How do I read a CSV in Python?"})
Configuration Reference
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
LANGSMITH_API_KEY | Yes | -- | Your LangSmith API key |
LANGSMITH_TRACING | Yes | false | Enable tracing (true/false) |
LANGSMITH_PROJECT | No | "default" | Project name for traces |
LANGSMITH_ENDPOINT | No | https://api.smith.langchain.com | API endpoint URL |
LANGSMITH_TRACING_SAMPLING_RATE | No | 1.0 | Fraction of traces to capture |
@traceable Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
run_type | str | "chain" | Run type: "chain", "llm", "retriever", "tool" |
name | str | Function name | Display name in trace viewer |
project_name | str | env default | Override project for this function |
tags | list[str] | None | Tags for filtering and organization |
metadata | dict | None | Arbitrary metadata attached to the run |
process_inputs | callable | None | Transform inputs before logging |
process_outputs | callable | None | Transform outputs before logging |
Best Practices
- Enable tracing in all environments -- Use
LANGSMITH_TRACING=truein development for full visibility, and sampling (10-25%) in production to balance cost and observability. - Use
@traceableon every pipeline stage -- Decorate retrieval, LLM calls, post-processing, and tool functions individually so traces show the full hierarchy. - Build evaluation datasets from production -- Curate high-quality examples from real user interactions marked with positive feedback, rather than only synthetic data.
- Run evaluations before deployment -- Set up automated evaluation in CI/CD pipelines to catch regressions before they reach production.
- Attach human feedback to runs -- Use
create_feedbackto capture thumbs-up/down signals from users and annotators, building a labeled dataset over time. - Redact sensitive data with
process_inputs-- Always sanitize PII, credentials, and proprietary content before it reaches the tracing backend. - Use projects to separate concerns -- Create distinct projects for production, staging, experiments, and evaluation to keep traces organized and filterable.
- Compare experiments systematically -- Use
experiment_prefixin evaluations to create named experiments that can be compared side-by-side in the dashboard. - Tag traces for segmentation -- Add tags like
["premium-user", "search-query"]to enable filtering and analysis of specific request types. - Monitor token usage and cost -- Track
total_tokensandlatencyin production traces to detect cost spikes and performance degradation early.
Troubleshooting
Traces not appearing in dashboard:
Verify LANGSMITH_TRACING=true is set and LANGSMITH_API_KEY is valid. Check network connectivity to api.smith.langchain.com. Traces are batched and may take 5-10 seconds to appear.
Missing child runs in trace hierarchy:
Ensure child functions are also decorated with @traceable. Functions called within a traced parent are only captured if they are also traceable.
High tracing overhead in production:
Set LANGSMITH_TRACING_SAMPLING_RATE to 0.1 (10%) to reduce volume. The sampling is applied per-trace, so sampled traces are still complete.
Evaluation results differ between runs: If using LLM-based evaluators, set a fixed temperature (0.0) for deterministic evaluation. Custom evaluators should be deterministic by design.
Dataset creation fails with large examples: LangSmith limits individual example sizes. For large documents, store them externally and reference by URL or ID in the dataset inputs.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.