LangChain - Build LLM Applications with Agents and RAG

Overview

LangChain is the most widely adopted framework for building applications powered by large language models. With 119,000+ GitHub stars and 500+ integrations, it provides the connective tissue between LLMs, tools, data sources, and memory systems. At its core, LangChain abstracts the common patterns of LLM application development -- prompt templating, chain composition, agent reasoning, retrieval-augmented generation, and conversational memory -- into a composable, provider-agnostic API.

LangChain matters because it eliminates the boilerplate of LLM application development. Swapping from OpenAI to Anthropic to a local Ollama model requires changing one line. Building a ReAct agent with tool calling takes fewer than 10 lines. Adding RAG over your documents is a well-paved path with dozens of vector store integrations. And when you need observability, LangSmith gives you full traces of every chain execution, tool call, and token.

The ecosystem splits into three packages: langchain-core (interfaces and base abstractions), langchain (chains, agents, and retrieval logic), and langchain-community (500+ third-party integrations). Provider-specific packages like langchain-openai and langchain-anthropic provide optimized model bindings.

When to Use

Building agents that reason about which tools to call (ReAct pattern, function calling)
Implementing RAG pipelines over documents, web pages, code, or databases
Prototyping LLM applications quickly with swappable model providers
Creating chatbots with conversation memory (buffer, summary, or window-based)
Need structured output from LLMs (Pydantic models, JSON schemas)
Streaming agent execution steps to a frontend in real time
Production deployments that need LangSmith tracing for debugging and monitoring
Combining multiple data sources (PDFs, CSVs, APIs, databases) into a single queryable interface

Quick Start

Installation


# Core + Anthropic (recommended for Claude Code users)
pip install -U langchain langchain-anthropic

# Or with OpenAI
pip install -U langchain langchain-openai

# Common extras
pip install langchain-community    # 500+ integrations
pip install langchain-chroma       # Chroma vector store
pip install langchain-pinecone     # Pinecone vector store

# Set API keys
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

Hello World -- LLM Call


from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929")
response = llm.invoke("What are the three laws of thermodynamics? Be concise.")
print(response.content)

Hello World -- Agent with Tools


from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_anthropic import ChatAnthropic
from langchain.tools import tool
from langchain_core.prompts import ChatPromptTemplate

@tool
def get_stock_price(ticker: str) -> str:
    """Get the current stock price for a given ticker symbol."""
    # In production, call a real API
    prices = {"AAPL": 198.50, "GOOGL": 175.30, "MSFT": 425.80}
    price = prices.get(ticker.upper(), None)
    if price:
        return f"{ticker.upper()}: ${price}"
    return f"Ticker {ticker} not found."

@tool
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression. Input: valid Python math expression."""
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {e}"

# Build the agent
llm = ChatAnthropic(model="claude-sonnet-4-5-20250929")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a financial analyst. Use tools to answer questions accurately."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}")
])

agent = create_tool_calling_agent(llm, [get_stock_price, calculate], prompt)
executor = AgentExecutor(agent=agent, tools=[get_stock_price, calculate], verbose=True)

result = executor.invoke({"input": "What is AAPL's stock price times 100 shares?"})
print(result["output"])

Core Concepts

1. Models -- Provider Abstraction

LangChain provides a unified interface across LLM providers. Swap models by changing one line:


from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_google_genai import ChatGoogleGenerativeAI

# All share the same interface
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0.7)
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0.7)

# Streaming (works with any provider)
for chunk in llm.stream("Explain recursion in 3 sentences"):
    print(chunk.content, end="", flush=True)

# Batch processing
responses = llm.batch([
    "Summarize quantum computing",
    "Explain neural networks",
    "Describe blockchain"
])

2. Prompt Templates


from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Simple template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert {role}. Respond in {language}."),
    ("human", "{question}")
])

# With conversation history
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful coding assistant."),
    MessagesPlaceholder("history"),     # Injected conversation history
    ("human", "{input}"),
    MessagesPlaceholder("agent_scratchpad")  # For agent reasoning
])

# Invoke with variables
chain = prompt | llm
result = chain.invoke({"role": "physicist", "language": "English", "question": "What is dark matter?"})

3. Chains -- Composable Pipelines

LangChain Expression Language (LCEL) lets you compose chains with the | operator:


from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

# Chain: prompt -> LLM -> parse output
summarize_chain = (
    ChatPromptTemplate.from_template("Summarize this in 3 bullet points:\n{text}")
    | llm
    | StrOutputParser()
)

result = summarize_chain.invoke({"text": "Long article text here..."})
print(result)  # Plain string output

# Chain composition -- pipe one chain into another
translate_chain = (
    ChatPromptTemplate.from_template("Translate to {language}:\n{text}")
    | llm
    | StrOutputParser()
)

# Summarize then translate
full_chain = summarize_chain | (lambda summary: {"text": summary, "language": "Spanish"}) | translate_chain

4. Agents -- Autonomous Tool-Using Reasoning

Agents use the ReAct (Reasoning + Acting) pattern: the LLM decides which tool to call, observes the result, and continues reasoning until it can answer the question.


from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.tools import tool
from langchain_core.prompts import ChatPromptTemplate

@tool
def search_documentation(query: str) -> str:
    """Search project documentation for relevant information."""
    # In production: vector store lookup
    return f"Documentation results for: {query}"

@tool
def run_sql_query(query: str) -> str:
    """Execute a read-only SQL query against the analytics database."""
    # In production: actual database connection
    return "Query returned 42 rows. Top result: revenue=$1.2M"

@tool
def send_slack_message(channel: str, message: str) -> str:
    """Send a message to a Slack channel."""
    return f"Message sent to #{channel}"

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a data analyst assistant. Use tools to find answers and communicate results."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}")
])

tools = [search_documentation, run_sql_query, send_slack_message]
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=10)

# The agent will: 1) query DB, 2) analyze results, 3) send to Slack
result = executor.invoke({
    "input": "What was last quarter's revenue? Post the result to #analytics"
})

5. Memory -- Conversation State


from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationChain

# Keep last 10 exchanges (20 messages)
memory = ConversationBufferWindowMemory(k=10)

conversation = ConversationChain(llm=llm, memory=memory, verbose=True)

conversation.predict(input="My name is Alice and I work on the backend team.")
conversation.predict(input="What team do I work on?")
# Output: "You mentioned you work on the backend team."

# Other memory types:
from langchain.memory import (
    ConversationBufferMemory,           # Keep everything (unbounded)
    ConversationSummaryMemory,          # Summarize older messages
    ConversationSummaryBufferMemory,    # Hybrid: recent messages + summary
    ConversationTokenBufferMemory,      # Bounded by token count
)

RAG (Retrieval-Augmented Generation)

Complete RAG Pipeline


from langchain_community.document_loaders import (
    WebBaseLoader, PyPDFLoader, DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA

# Step 1: Load documents from multiple sources
web_docs = WebBaseLoader([
    "https://docs.python.org/3/tutorial/classes.html",
    "https://docs.python.org/3/tutorial/errors.html"
]).load()

pdf_docs = PyPDFLoader("technical_spec.pdf").load()

all_docs = web_docs + pdf_docs

# Step 2: Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(all_docs)
print(f"Split into {len(chunks)} chunks")

# Step 3: Embed and store in vector database
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./chroma_db"
)

# Step 4: Create retriever
retriever = vectorstore.as_retriever(
    search_type="mmr",           # Maximum Marginal Relevance (diverse results)
    search_kwargs={"k": 4, "fetch_k": 10}
)

# Step 5: Build QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",          # Or "map_reduce", "refine", "map_rerank"
    retriever=retriever,
    return_source_documents=True
)

# Step 6: Query
result = qa_chain.invoke({"query": "How do Python classes handle inheritance?"})
print(result["result"])
for doc in result["source_documents"]:
    print(f"  Source: {doc.metadata.get('source', 'unknown')}")

Conversational RAG (chatbot over documents)


from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    output_key="answer"
)

chat_qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    return_source_documents=True
)

# Multi-turn conversation
chat_qa.invoke({"question": "What is Python's GIL?"})
chat_qa.invoke({"question": "How does it affect multithreading?"})  # Remembers context
chat_qa.invoke({"question": "What are alternatives?"})              # Builds on previous answers

Advanced Patterns

Structured Output


from pydantic import BaseModel, Field
from typing import List

class CodeReview(BaseModel):
    file_path: str = Field(description="Path to the reviewed file")
    issues: List[str] = Field(description="List of identified issues")
    severity: str = Field(description="Overall severity: critical, high, medium, low")
    suggestions: List[str] = Field(description="Improvement suggestions")

# LLM returns structured data
structured_llm = llm.with_structured_output(CodeReview)
review = structured_llm.invoke(
    "Review this code: def add(a,b): return a+b  # no type hints, no docstring"
)
print(review.file_path)    # str
print(review.issues)       # List[str]
print(review.severity)     # str

Parallel Tool Execution


# Modern LLMs can call multiple tools simultaneously
# LangChain handles this automatically with tool-calling agents

@tool
def get_weather(city: str) -> str:
    """Get weather for a city."""
    return f"Sunny, 72F in {city}"

@tool
def get_population(city: str) -> str:
    """Get population of a city."""
    return f"{city} population: 2.1 million"

# When asked "Compare Paris and London", the agent will call
# get_weather("Paris"), get_weather("London"),
# get_population("Paris"), get_population("London")
# in parallel if the LLM supports it

Streaming Agent Steps


# Stream individual steps as the agent reasons
async for event in executor.astream_events(
    {"input": "Research and summarize recent AI news"},
    version="v2"
):
    if event["event"] == "on_tool_start":
        print(f"Calling tool: {event['name']}")
    elif event["event"] == "on_tool_end":
        print(f"Tool result: {event['data']['output'][:100]}...")
    elif event["event"] == "on_llm_stream":
        print(event["data"]["chunk"].content, end="")

LangSmith Observability


import os

# Enable tracing (all chains and agents automatically traced)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__your_api_key"
os.environ["LANGCHAIN_PROJECT"] = "production-chatbot"

# Every invoke() is now traced
result = executor.invoke({"input": "What is the quarterly revenue?"})
# View traces at https://smith.langchain.com
# See: latency per step, token usage, tool calls, full prompt/response

Vector Store Integrations

Vector Store	Type	Install	Best For
Chroma	Local/embedded	`pip install langchain-chroma`	Development, small datasets
FAISS	Local/in-memory	`pip install langchain-community[faiss]`	Fast similarity search
Pinecone	Cloud (managed)	`pip install langchain-pinecone`	Production, scalability
Weaviate	Self-hosted/cloud	`pip install langchain-weaviate`	Hybrid search
Qdrant	Self-hosted/cloud	`pip install langchain-qdrant`	Filtering + vector search
pgvector	PostgreSQL extension	`pip install langchain-postgres`	Existing Postgres infra


# Chroma (local development)
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./db")

# FAISS (fast, in-memory)
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(docs, embeddings)
vectorstore.save_local("faiss_index")
vectorstore = FAISS.load_local("faiss_index", embeddings)

# Pinecone (production)
from langchain_pinecone import PineconeVectorStore
vectorstore = PineconeVectorStore.from_documents(docs, embeddings, index_name="prod")

Document Loaders and Text Splitters


# Loaders for every source type
from langchain_community.document_loaders import (
    WebBaseLoader,           # Web pages
    PyPDFLoader,             # PDF files
    CSVLoader,               # CSV files
    TextLoader,              # Plain text
    JSONLoader,              # JSON files
    GitLoader,               # Git repositories
    NotionDirectoryLoader,   # Notion exports
    UnstructuredMarkdownLoader,  # Markdown files
)

# Text splitters for different content types
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,  # General text (recommended default)
    PythonCodeTextSplitter,          # Python source code
    MarkdownTextSplitter,            # Markdown documents
    HTMLSectionSplitter,             # HTML documents
)

# Semantic chunking (split by meaning, not character count)
from langchain_experimental.text_splitter import SemanticChunker
splitter = SemanticChunker(OpenAIEmbeddings())

Configuration Reference

Environment Variables

Variable	Purpose
`OPENAI_API_KEY`	OpenAI models and embeddings
`ANTHROPIC_API_KEY`	Anthropic Claude models
`GOOGLE_API_KEY`	Google Gemini models
`LANGCHAIN_TRACING_V2`	Enable LangSmith tracing (`"true"`)
`LANGCHAIN_API_KEY`	LangSmith API key
`LANGCHAIN_PROJECT`	LangSmith project name
`LANGCHAIN_ENDPOINT`	LangSmith API endpoint

Performance Benchmarks

Operation	Typical Latency	Notes
Simple LLM call	1-3s	Varies by provider and model
Agent with 1 tool call	3-5s	ReAct reasoning overhead
Agent with 3 parallel tools	4-7s	Parallel execution helps
RAG retrieval + answer	1-3s	Vector search (~200ms) + LLM
Embed 1000 documents	10-30s	Batch embedding recommended
Streaming first token	200-500ms	Much better perceived latency

Best Practices

Start with create_tool_calling_agent. It is the simplest, most reliable agent factory in LangChain. Use it unless you have a specific reason to use a different agent type.
Enable streaming for all user-facing applications. Streaming reduces perceived latency from seconds to milliseconds for the first token. Use astream_events for fine-grained control.
Use LCEL (| operator) over legacy chains. LCEL is the modern way to compose LangChain components. Legacy classes like LLMChain still work but are not actively developed.
Optimize RAG chunk size. Start with 800-1000 characters and 150-200 character overlap. Too small loses context; too large wastes tokens. Test with your actual queries.
Enable LangSmith in every environment. Even in development, traces are invaluable for debugging. In production, they give you latency, cost, and quality metrics.
Handle tool errors gracefully. Tools fail. Wrap tool functions in try/except and return descriptive error strings instead of raising exceptions. The agent can then reason about the error.
Use MMR retrieval over simple similarity. Maximum Marginal Relevance (search_type="mmr") returns diverse results instead of 4 near-duplicate chunks.
Cache embeddings aggressively. Embedding the same documents repeatedly is expensive. Persist your vector store to disk and reload on startup.
Set max_iterations on agents. Without a limit, a confused agent can loop indefinitely. Set it to 10-15 for most use cases.
Version your prompts. Store prompt templates in files or a prompt registry (LangSmith Hub). Track changes over time to correlate prompt edits with quality changes.

Troubleshooting

Agent keeps calling the same tool in a loop:


# Set max_iterations to break the loop
executor = AgentExecutor(
    agent=agent,
    tools=tools,
    max_iterations=10,    # Force stop after 10 iterations
    handle_parsing_errors=True  # Gracefully handle malformed tool calls
)

RAG returns irrelevant chunks:


# Increase fetch_k and use MMR for diversity
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20, "lambda_mult": 0.5}
)
# Also try: smaller chunk_size, more chunk_overlap, better splitter

"Could not parse LLM output" errors:


# Enable error handling in the executor
executor = AgentExecutor(
    agent=agent,
    tools=tools,
    handle_parsing_errors=True  # Converts parsing errors to agent observations
)

Slow embedding on large document sets:


# Use batch embedding with concurrency
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    chunk_size=1000  # Batch size for API calls
)
# Also: persist vectorstore to avoid re-embedding
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./db")

Memory growing too large:


# Use windowed memory instead of unbounded buffer
from langchain.memory import ConversationBufferWindowMemory
memory = ConversationBufferWindowMemory(k=5)  # Keep last 5 exchanges

# Or token-limited memory
from langchain.memory import ConversationTokenBufferMemory
memory = ConversationTokenBufferMemory(llm=llm, max_token_limit=2000)

LangChain vs LangGraph

Dimension	LangChain	LangGraph
Abstraction level	High (opinionated defaults)	Low (explicit control)
Lines to start	<10	~30-50
Agent pattern	ReAct via AgentExecutor	Custom state machines
Cyclic workflows	Not supported	Native
Human-in-the-loop	Basic (tool approval)	Advanced (checkpoints, branching)
Multi-agent	Limited	Native sub-graphs
State management	Memory classes	Typed state objects
When to use	Quick prototyping, standard RAG	Complex production workflows

Use LangChain when you want to move fast. Use LangGraph when you need fine-grained control over a stateful, potentially cyclic workflow.

Resources

GitHub: https://github.com/langchain-ai/langchain (119,000+ stars)
Documentation: https://docs.langchain.com
API Reference: https://reference.langchain.com/python
LangSmith: https://smith.langchain.com (observability)
LangChain Hub: https://smith.langchain.com/hub (prompt sharing)
Version: 0.3+ (stable)
License: MIT

⚠️ Loading Issue

Agents Langchain Expert