LangSmith -- LLM Observability and Evaluation Platform

Overview

A comprehensive skill for debugging, evaluating, and monitoring language model applications using LangSmith. LangSmith provides end-to-end observability for LLM-powered systems -- capturing detailed traces of every LLM call, retrieval operation, and tool invocation across your application. It enables systematic evaluation against curated datasets, production monitoring with cost and latency tracking, and collaborative prompt engineering through the Prompt Hub. LangSmith integrates natively with LangChain but also works as a standalone tracing platform for OpenAI, Anthropic, and any custom LLM pipeline through the @traceable decorator and client wrappers.

When to Use

Debugging LLM application issues by inspecting full execution traces
Evaluating model outputs systematically against labeled datasets
Monitoring production LLM systems for latency, errors, token usage, and cost
Building regression test suites for AI features before deployment
Collaborating on prompt engineering with version-controlled prompts
Comparing model performance across experiments and configurations
Tracing complex agent workflows with tool calls, retrieval, and multi-step reasoning

Quick Start


# Install
pip install langsmith

# Set environment variables
export LANGSMITH_API_KEY="your-api-key"
export LANGSMITH_TRACING=true


from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable
def answer_question(question: str) -> str:
    """Every call to this function is automatically traced to LangSmith."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content

# Call the function -- trace appears in LangSmith dashboard
result = answer_question("What is retrieval-augmented generation?")

Core Concepts

Runs and Traces

A run is a single unit of execution -- an LLM call, a retriever query, or a tool invocation. Runs are organized into hierarchical traces that show the full execution flow of a request.


from langsmith import traceable

@traceable(run_type="chain")
def process_query(query: str) -> str:
    """Parent run: the full pipeline."""
    context = retrieve_context(query)     # Child run 1
    answer = generate_answer(query, context)  # Child run 2
    return answer

@traceable(run_type="retriever")
def retrieve_context(query: str) -> list[str]:
    """Traced as a retriever run."""
    return vector_store.similarity_search(query, k=5)

@traceable(run_type="llm")
def generate_answer(query: str, context: list[str]) -> str:
    """Traced as an LLM run with inputs/outputs captured."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content

Automatic Tracing with Client Wrappers


from langsmith.wrappers import wrap_openai
from openai import OpenAI

# Wrap the OpenAI client for automatic tracing
client = wrap_openai(OpenAI())

# All calls are traced without decorators
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

Projects and Organization


import os

# Set project globally
os.environ["LANGSMITH_PROJECT"] = "production-chatbot"

# Or per-function
@traceable(project_name="experiment-v2")
def experimental_pipeline(query: str) -> str:
    pass

# Or with context manager
from langsmith import tracing_context

with tracing_context(
    project_name="a-b-test",
    tags=["variant-a", "gpt-4o"],
    metadata={"version": "2.1", "team": "search"},
):
    result = process_query("How does RAG work?")

Datasets and Evaluation

Creating Datasets


from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    "qa-evaluation-set",
    description="Golden QA pairs for regression testing",
)

# Add labeled examples
client.create_examples(
    inputs=[
        {"question": "What is Python?"},
        {"question": "Explain gradient descent."},
        {"question": "What is a transformer?"},
    ],
    outputs=[
        {"answer": "A high-level, general-purpose programming language."},
        {"answer": "An optimization algorithm that iteratively adjusts parameters."},
        {"answer": "A neural network architecture based on self-attention."},
    ],
    dataset_id=dataset.id,
)

# Create examples from production traces
runs = client.list_runs(project_name="production-chatbot", limit=50)
for run in runs:
    if run.feedback_stats and run.feedback_stats.get("correctness", {}).get("avg", 0) > 0.8:
        client.create_example(
            inputs=run.inputs,
            outputs=run.outputs,
            dataset_id=dataset.id,
        )

Running Evaluations


from langsmith import evaluate

# Define your target function
def my_chatbot(inputs: dict) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": inputs["question"]}],
    )
    return {"answer": response.choices[0].message.content}

# Define custom evaluators
def answer_relevance(run, example) -> dict:
    """Check if the answer addresses the question."""
    prediction = run.outputs["answer"]
    reference = example.outputs["answer"]
    # Simple keyword overlap score
    pred_words = set(prediction.lower().split())
    ref_words = set(reference.lower().split())
    overlap = len(pred_words & ref_words) / max(len(ref_words), 1)
    return {"key": "relevance", "score": min(overlap * 2, 1.0)}

def answer_length(run, example) -> dict:
    """Check answer is not too short or too long."""
    length = len(run.outputs["answer"])
    score = 1.0 if 50 < length < 500 else 0.5
    return {"key": "length_check", "score": score}

# Run evaluation
results = evaluate(
    my_chatbot,
    data="qa-evaluation-set",
    evaluators=[answer_relevance, answer_length],
    experiment_prefix="gpt-4o-baseline",
    max_concurrency=4,
)

print(f"Relevance: {results.aggregate_metrics['relevance']:.3f}")
print(f"Length check: {results.aggregate_metrics['length_check']:.3f}")

LLM-as-Judge Evaluation


from langsmith.evaluation import LangChainStringEvaluator

# Use LLM-based evaluators for nuanced assessment
results = evaluate(
    my_chatbot,
    data="qa-evaluation-set",
    evaluators=[
        LangChainStringEvaluator("qa"),         # QA correctness
        LangChainStringEvaluator("cot_qa"),      # Chain-of-thought QA
        LangChainStringEvaluator("helpfulness"),  # Overall helpfulness
    ],
    experiment_prefix="gpt-4o-llm-judge",
)

Client API


from langsmith import Client

client = Client()

# List recent runs with filters
runs = list(client.list_runs(
    project_name="production-chatbot",
    filter='and(eq(status, "success"), gt(latency, 5))',
    limit=100,
))

# Attach human feedback to a run
client.create_feedback(
    run_id=run.id,
    key="correctness",
    score=0.9,
    comment="Accurate answer but missing one detail.",
)

# Read run details
run = client.read_run(run_id="run-uuid-here")
print(f"Latency: {run.latency:.2f}s, Tokens: {run.total_tokens}")

Advanced Tracing

Sanitizing Sensitive Data


def redact_pii(inputs: dict) -> dict:
    """Remove sensitive fields before logging."""
    sanitized = inputs.copy()
    for key in ["password", "ssn", "credit_card", "api_key"]:
        if key in sanitized:
            sanitized[key] = "[REDACTED]"
    return sanitized

@traceable(process_inputs=redact_pii)
def authenticate_user(username: str, password: str) -> bool:
    # password is redacted in LangSmith traces
    return check_credentials(username, password)

Manual Run Management


from langsmith import trace

with trace(
    name="custom_retrieval",
    run_type="retriever",
    inputs={"query": "machine learning basics"},
) as run:
    results = perform_search("machine learning basics")
    run.end(outputs={"documents": results, "count": len(results)})

Production Sampling


import os

# Trace only 10% of production requests to reduce cost
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"

LangChain Integration


from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# With LANGSMITH_TRACING=true, all LangChain runs are automatically traced
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful coding assistant."),
    ("user", "{question}"),
])

chain = prompt | llm | StrOutputParser()

# Full chain trace visible in LangSmith: prompt -> LLM -> parser
response = chain.invoke({"question": "How do I read a CSV in Python?"})

Configuration Reference

Environment Variables

Variable	Required	Default	Description
`LANGSMITH_API_KEY`	Yes	--	Your LangSmith API key
`LANGSMITH_TRACING`	Yes	`false`	Enable tracing (`true`/`false`)
`LANGSMITH_PROJECT`	No	`"default"`	Project name for traces
`LANGSMITH_ENDPOINT`	No	`https://api.smith.langchain.com`	API endpoint URL
`LANGSMITH_TRACING_SAMPLING_RATE`	No	`1.0`	Fraction of traces to capture

@traceable Parameters

Parameter	Type	Default	Description
`run_type`	str	`"chain"`	Run type: `"chain"`, `"llm"`, `"retriever"`, `"tool"`
`name`	str	Function name	Display name in trace viewer
`project_name`	str	env default	Override project for this function
`tags`	list[str]	None	Tags for filtering and organization
`metadata`	dict	None	Arbitrary metadata attached to the run
`process_inputs`	callable	None	Transform inputs before logging
`process_outputs`	callable	None	Transform outputs before logging

Best Practices

Enable tracing in all environments -- Use LANGSMITH_TRACING=true in development for full visibility, and sampling (10-25%) in production to balance cost and observability.
Use @traceable on every pipeline stage -- Decorate retrieval, LLM calls, post-processing, and tool functions individually so traces show the full hierarchy.
Build evaluation datasets from production -- Curate high-quality examples from real user interactions marked with positive feedback, rather than only synthetic data.
Run evaluations before deployment -- Set up automated evaluation in CI/CD pipelines to catch regressions before they reach production.
Attach human feedback to runs -- Use create_feedback to capture thumbs-up/down signals from users and annotators, building a labeled dataset over time.
Redact sensitive data with process_inputs -- Always sanitize PII, credentials, and proprietary content before it reaches the tracing backend.
Use projects to separate concerns -- Create distinct projects for production, staging, experiments, and evaluation to keep traces organized and filterable.
Compare experiments systematically -- Use experiment_prefix in evaluations to create named experiments that can be compared side-by-side in the dashboard.
Tag traces for segmentation -- Add tags like ["premium-user", "search-query"] to enable filtering and analysis of specific request types.
Monitor token usage and cost -- Track total_tokens and latency in production traces to detect cost spikes and performance degradation early.

Troubleshooting

Traces not appearing in dashboard: Verify LANGSMITH_TRACING=true is set and LANGSMITH_API_KEY is valid. Check network connectivity to api.smith.langchain.com. Traces are batched and may take 5-10 seconds to appear.

Missing child runs in trace hierarchy: Ensure child functions are also decorated with @traceable. Functions called within a traced parent are only captured if they are also traceable.

High tracing overhead in production: Set LANGSMITH_TRACING_SAMPLING_RATE to 0.1 (10%) to reduce volume. The sampling is applied per-trace, so sampled traces are still complete.

Evaluation results differ between runs: If using LLM-based evaluators, set a fixed temperature (0.0) for deterministic evaluation. Custom evaluators should be deterministic by design.

Dataset creation fails with large examples: LangSmith limits individual example sizes. For large documents, store them externally and reference by URL or ID in the dataset inputs.

⚠️ Loading Issue

Ultimate Observability Langsmith