A

Agents Llamaindex Kit

Comprehensive skill designed for data, framework, building, applications. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

LlamaIndex - Data Framework for LLM Applications

Overview

LlamaIndex is the leading framework for connecting large language models with your data. While other frameworks focus on general agent orchestration or chain composition, LlamaIndex is purpose-built for one thing: making your data queryable by LLMs. It provides a complete pipeline from data ingestion (300+ connectors on LlamaHub) through indexing, retrieval, and response synthesis, with first-class support for RAG (Retrieval-Augmented Generation) patterns.

LlamaIndex matters because RAG is the most practical way to make LLMs useful over private, domain-specific data without fine-tuning. The framework handles the hard parts: intelligent document chunking, embedding management, vector store abstraction, retrieval strategies (similarity, keyword, hybrid), response synthesis modes (stuff, tree summarize, refine), and evaluation metrics to ensure your system actually works. You can go from a folder of documents to a working Q&A system in 5 lines of code, then progressively customize every layer as your requirements grow.

The framework is organized as a modular package ecosystem: llama-index-core provides the base abstractions, and specific integrations (LLMs, embeddings, vector stores, data loaders) are installed as separate packages. This keeps your dependency tree lean.

When to Use

  • Building RAG applications that answer questions over private documents
  • Need document Q&A over PDFs, web pages, databases, APIs, or code repositories
  • Ingesting data from many heterogeneous sources (300+ connectors via LlamaHub)
  • Creating knowledge bases that ground LLM responses in factual data
  • Building chatbots that reference enterprise documentation
  • Need structured data extraction from unstructured documents
  • Evaluating RAG quality with built-in relevancy and faithfulness metrics
  • Building multi-modal RAG (images + text + tables)
  • Want the simplest path from "I have documents" to "I can query them"

Quick Start

Installation

# Full starter package (includes OpenAI integration) pip install llama-index # Or minimal install with specific providers pip install llama-index-core pip install llama-index-llms-anthropic # For Claude pip install llama-index-llms-openai # For GPT pip install llama-index-embeddings-openai # Embeddings pip install llama-index-vector-stores-chroma # Vector store # Set API keys export OPENAI_API_KEY="sk-..." # Or for Anthropic: export ANTHROPIC_API_KEY="sk-ant-..."

5-Line RAG

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader # 1. Load all documents from a directory documents = SimpleDirectoryReader("./data").load_data() # 2. Build the index (chunks, embeds, and stores) index = VectorStoreIndex.from_documents(documents) # 3. Query response = index.as_query_engine().query("What is the main topic of these documents?") print(response)

Production-Ready RAG (with persistence)

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage import os PERSIST_DIR = "./storage" if os.path.exists(PERSIST_DIR): # Load existing index from disk storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR) index = load_index_from_storage(storage_context) print("Loaded existing index.") else: # Build new index and persist documents = SimpleDirectoryReader("./data").load_data() index = VectorStoreIndex.from_documents(documents) index.storage_context.persist(persist_dir=PERSIST_DIR) print(f"Built index from {len(documents)} documents and saved to {PERSIST_DIR}.") # Query with configuration query_engine = index.as_query_engine( similarity_top_k=5, response_mode="compact", streaming=True ) response = query_engine.query("Summarize the key findings") for text in response.response_gen: print(text, end="", flush=True)

Core Concepts

1. Data Connectors (Loaders)

LlamaIndex loads data from virtually any source into a normalized Document format.

from llama_index.core import SimpleDirectoryReader, Document # Local files (PDF, DOCX, TXT, MD, CSV, images, etc.) documents = SimpleDirectoryReader( "./data", recursive=True, # Traverse subdirectories required_exts=[".pdf", ".md", ".txt"], # Filter by extension filename_as_id=True # Use filename as doc ID ).load_data() # Web pages from llama_index.readers.web import SimpleWebPageReader documents = SimpleWebPageReader(html_to_text=True).load_data([ "https://docs.python.org/3/tutorial/classes.html", "https://docs.python.org/3/tutorial/errors.html" ]) # GitHub repository from llama_index.readers.github import GithubRepositoryReader documents = GithubRepositoryReader( owner="run-llama", repo="llama_index", filter_file_extensions=[".py", ".md"], verbose=True ).load_data(branch="main") # Database from llama_index.readers.database import DatabaseReader reader = DatabaseReader(sql_database_uri="postgresql://user:pass@localhost/db") documents = reader.load_data(query="SELECT title, content FROM articles WHERE published = true") # Manual document creation doc = Document( text="This is custom content.", metadata={"source": "manual", "category": "tutorial", "date": "2025-06-15"} )

2. Indices -- Data Structures for Retrieval

Indices organize your documents for efficient querying. Each index type optimizes for different access patterns.

from llama_index.core import VectorStoreIndex, SummaryIndex, TreeIndex, KeywordTableIndex # VectorStoreIndex (most common -- semantic similarity search) vector_index = VectorStoreIndex.from_documents(documents) # SummaryIndex (formerly ListIndex -- scans all nodes sequentially) # Good for summarization tasks over entire corpus summary_index = SummaryIndex.from_documents(documents) # TreeIndex (hierarchical summarization) # Good for multi-level summarization tree_index = TreeIndex.from_documents(documents) # KeywordTableIndex (keyword-based retrieval) # Good for precise keyword matching keyword_index = KeywordTableIndex.from_documents(documents) # Persist any index vector_index.storage_context.persist(persist_dir="./vector_storage") summary_index.storage_context.persist(persist_dir="./summary_storage") # Load from disk from llama_index.core import load_index_from_storage, StorageContext storage = StorageContext.from_defaults(persist_dir="./vector_storage") loaded_index = load_index_from_storage(storage)

3. Query Engines -- Ask Questions

Query engines combine retrieval and response synthesis into a single queryable interface.

# Basic query engine query_engine = index.as_query_engine() response = query_engine.query("What are the main conclusions?") # Configurable query engine query_engine = index.as_query_engine( similarity_top_k=5, # Retrieve top 5 chunks response_mode="tree_summarize", # Synthesis strategy verbose=True # Show retrieval details ) # Response modes: # "compact" - Stuff as many chunks as fit into one LLM call (default) # "tree_summarize" - Hierarchically summarize chunks # "refine" - Iteratively refine answer with each chunk # "simple_summarize" - Simple concatenation and summarize # "no_text" - Return retrieved nodes without LLM synthesis # "accumulate" - Get separate answer per chunk # Streaming query_engine = index.as_query_engine(streaming=True) response = query_engine.query("Explain the architecture") for token in response.response_gen: print(token, end="", flush=True) # Access source nodes (for citations) response = query_engine.query("What is the system design?") print(response) for node in response.source_nodes: print(f" Score: {node.score:.3f}") print(f" Source: {node.metadata.get('file_name', 'unknown')}") print(f" Text: {node.text[:100]}...")

4. Retrievers -- Fine-Grained Control

When you need more control over what gets retrieved:

# Vector retriever (default) retriever = index.as_retriever(similarity_top_k=10) nodes = retriever.retrieve("machine learning algorithms") for node in nodes: print(f"Score: {node.score:.3f} | {node.text[:80]}...") # Metadata filtering from llama_index.core.vector_stores import MetadataFilters, MetadataFilter filters = MetadataFilters(filters=[ MetadataFilter(key="category", value="tutorial"), MetadataFilter(key="difficulty", value="beginner") ]) retriever = index.as_retriever( similarity_top_k=5, filters=filters ) # Custom retriever from llama_index.core.retrievers import BaseRetriever from llama_index.core.schema import NodeWithScore class HybridRetriever(BaseRetriever): """Combines vector search with keyword matching.""" def __init__(self, vector_retriever, keyword_retriever): self.vector_retriever = vector_retriever self.keyword_retriever = keyword_retriever super().__init__() def _retrieve(self, query_bundle): vector_nodes = self.vector_retriever.retrieve(query_bundle) keyword_nodes = self.keyword_retriever.retrieve(query_bundle) # Merge and deduplicate seen = set() merged = [] for node in vector_nodes + keyword_nodes: if node.node.node_id not in seen: seen.add(node.node.node_id) merged.append(node) return sorted(merged, key=lambda x: x.score or 0, reverse=True)[:5]

Agents with Tools

LlamaIndex agents combine RAG with tool calling for complex reasoning tasks.

Basic Agent

from llama_index.core.agent import FunctionAgent from llama_index.llms.openai import OpenAI from llama_index.core.tools import FunctionTool def search_codebase(query: str) -> str: """Search the codebase for functions matching the query.""" # In production: actual code search return f"Found 3 functions matching '{query}': parse_config(), validate_config(), load_config()" def run_tests(test_path: str) -> str: """Run tests at the given path and return results.""" return f"Running tests at {test_path}: 12 passed, 0 failed" def create_pull_request(title: str, description: str) -> str: """Create a GitHub pull request.""" return f"Created PR: '{title}' - {description}" # Wrap plain functions as tools tools = [ FunctionTool.from_defaults(fn=search_codebase), FunctionTool.from_defaults(fn=run_tests), FunctionTool.from_defaults(fn=create_pull_request), ] # Create agent llm = OpenAI(model="gpt-4o") agent = FunctionAgent.from_tools(tools, llm=llm, verbose=True) response = agent.chat( "Find all config-related functions, run their tests, " "and create a PR summarizing the test results." ) print(response)

RAG Agent (documents + tools)

from llama_index.core.tools import QueryEngineTool # Create indices for different document sets api_docs_index = VectorStoreIndex.from_documents(api_docs) architecture_index = VectorStoreIndex.from_documents(arch_docs) # Wrap each index as a tool api_tool = QueryEngineTool.from_defaults( query_engine=api_docs_index.as_query_engine(), name="api_documentation", description="Search API documentation for endpoint details, request/response formats, and authentication." ) arch_tool = QueryEngineTool.from_defaults( query_engine=architecture_index.as_query_engine(), name="architecture_docs", description="Search architecture documentation for system design, data flow, and component relationships." ) # Agent can search both document sets + use custom tools agent = FunctionAgent.from_tools( [api_tool, arch_tool, search_codebase, run_tests], llm=llm, verbose=True, system_prompt=( "You are a senior developer assistant. Use the documentation tools " "to find information, and the codebase tools to verify implementation details." ) ) response = agent.chat("How does the authentication flow work? Check the API docs and architecture docs.")

Advanced RAG Patterns

Chat Engine (multi-turn conversation)

# Condense + Context mode: condenses follow-up questions with chat history, # then retrieves fresh context for each turn chat_engine = index.as_chat_engine( chat_mode="condense_plus_context", verbose=True ) r1 = chat_engine.chat("What is the system architecture?") print(r1) r2 = chat_engine.chat("How does the caching layer work?") # Builds on r1 print(r2) r3 = chat_engine.chat("What are its failure modes?") # Refers to caching print(r3) # Reset conversation chat_engine.reset()

Structured Output

from pydantic import BaseModel, Field from typing import List from llama_index.core.output_parsers import PydanticOutputParser class DocumentSummary(BaseModel): title: str = Field(description="Document title") key_topics: List[str] = Field(description="Main topics covered") sentiment: str = Field(description="Overall sentiment: positive, negative, or neutral") actionable_items: List[str] = Field(description="Action items extracted from the document") output_parser = PydanticOutputParser(output_cls=DocumentSummary) query_engine = index.as_query_engine(output_parser=output_parser) response = query_engine.query("Summarize the quarterly review document") # response is a DocumentSummary instance print(response.title) print(response.key_topics) print(response.actionable_items)

Multi-Modal RAG (images + text)

from llama_index.core import SimpleDirectoryReader from llama_index.multi_modal_llms.openai import OpenAIMultiModal # Load documents including images documents = SimpleDirectoryReader( "./data", required_exts=[".pdf", ".png", ".jpg", ".md"] ).load_data() # Build multi-modal index index = VectorStoreIndex.from_documents(documents) # Use multi-modal LLM for queries about visual content mm_llm = OpenAIMultiModal(model="gpt-4o") query_engine = index.as_query_engine(llm=mm_llm) response = query_engine.query("Describe the architecture diagram on page 5") print(response)

Vector Store Integrations

# Chroma (local, great for development) from llama_index.vector_stores.chroma import ChromaVectorStore import chromadb db = chromadb.PersistentClient(path="./chroma_db") collection = db.get_or_create_collection("my_docs") vector_store = ChromaVectorStore(chroma_collection=collection) # Pinecone (cloud, production scale) from llama_index.vector_stores.pinecone import PineconeVectorStore from pinecone import Pinecone pc = Pinecone(api_key="your-key") pinecone_index = pc.Index("my-index") vector_store = PineconeVectorStore(pinecone_index=pinecone_index) # FAISS (fast local similarity search) from llama_index.vector_stores.faiss import FaissVectorStore import faiss faiss_index = faiss.IndexFlatL2(1536) # Dimension of your embeddings vector_store = FaissVectorStore(faiss_index=faiss_index) # Qdrant (self-hosted, production features) from llama_index.vector_stores.qdrant import QdrantVectorStore from qdrant_client import QdrantClient client = QdrantClient(url="http://localhost:6333") vector_store = QdrantVectorStore(client=client, collection_name="my_docs") # Use any vector store in an index from llama_index.core import StorageContext storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Customization

Swap LLM Provider

from llama_index.core import Settings # Use Anthropic globally from llama_index.llms.anthropic import Anthropic Settings.llm = Anthropic(model="claude-sonnet-4-5-20250929") # Use local model via Ollama from llama_index.llms.ollama import Ollama Settings.llm = Ollama(model="llama3.1", request_timeout=120.0) # Per-query override (does not change global) query_engine = index.as_query_engine(llm=Anthropic(model="claude-sonnet-4-5-20250929"))

Custom Embeddings

from llama_index.core import Settings # OpenAI embeddings (default) from llama_index.embeddings.openai import OpenAIEmbedding Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small") # HuggingFace (free, local) from llama_index.embeddings.huggingface import HuggingFaceEmbedding Settings.embed_model = HuggingFaceEmbedding( model_name="sentence-transformers/all-mpnet-base-v2" ) # Cohere from llama_index.embeddings.cohere import CohereEmbedding Settings.embed_model = CohereEmbedding(model_name="embed-english-v3.0")

Custom Prompt Templates

from llama_index.core import PromptTemplate # Override the QA prompt qa_prompt = PromptTemplate( "You are a technical documentation expert.\n" "Context from the documentation:\n" "-----\n" "{context_str}\n" "-----\n" "Question: {query_str}\n\n" "Rules:\n" "1. Only answer based on the provided context.\n" "2. If the answer is not in the context, say 'Not found in documentation.'\n" "3. Include the relevant section name in your answer.\n" "4. Use code examples from the context when available.\n\n" "Answer: " ) query_engine = index.as_query_engine(text_qa_template=qa_prompt)

Custom Node Parsing (chunking)

from llama_index.core.node_parser import ( SentenceSplitter, SemanticSplitterNodeParser, MarkdownNodeParser, CodeSplitter, ) # Sentence-based splitting (recommended default) parser = SentenceSplitter(chunk_size=1024, chunk_overlap=200) # Semantic splitting (splits by meaning boundaries) from llama_index.embeddings.openai import OpenAIEmbedding parser = SemanticSplitterNodeParser( embed_model=OpenAIEmbedding(), buffer_size=1, breakpoint_percentile_threshold=95 ) # Markdown-aware splitting parser = MarkdownNodeParser() # Code-aware splitting parser = CodeSplitter(language="python", chunk_lines=40, chunk_lines_overlap=10) # Use in Settings from llama_index.core import Settings Settings.node_parser = parser

Evaluation

LlamaIndex provides built-in evaluation to measure RAG quality:

from llama_index.core.evaluation import ( RelevancyEvaluator, FaithfulnessEvaluator, BatchEvalRunner, ) # Relevancy: Does the response actually answer the question? relevancy_evaluator = RelevancyEvaluator() # Faithfulness: Is the response supported by the retrieved context? (no hallucination) faithfulness_evaluator = FaithfulnessEvaluator() # Evaluate a single response query = "What is the authentication flow?" response = query_engine.query(query) relevancy_result = relevancy_evaluator.evaluate_response(query=query, response=response) faithfulness_result = faithfulness_evaluator.evaluate_response(query=query, response=response) print(f"Relevant: {relevancy_result.passing} (score: {relevancy_result.score})") print(f"Faithful: {faithfulness_result.passing} (score: {faithfulness_result.score})") # Batch evaluation eval_questions = [ "How does authentication work?", "What is the database schema?", "How are errors handled?", ] runner = BatchEvalRunner( {"relevancy": relevancy_evaluator, "faithfulness": faithfulness_evaluator}, workers=4 ) eval_results = await runner.aevaluate_queries( query_engine, queries=eval_questions ) for query, results in zip(eval_questions, eval_results): print(f"Q: {query}") print(f" Relevancy: {results['relevancy'].passing}") print(f" Faithfulness: {results['faithfulness'].passing}")

Configuration Reference

Settings (Global Defaults)

SettingTypeDefaultDescription
Settings.llmBaseLLMOpenAI("gpt-3.5-turbo")Default LLM for all operations
Settings.embed_modelBaseEmbeddingOpenAIEmbeddingDefault embedding model
Settings.node_parserNodeParserSentenceSplitterDefault chunking strategy
Settings.chunk_sizeint1024Default chunk size (tokens)
Settings.chunk_overlapint20Default chunk overlap (tokens)
Settings.num_outputint256Max output tokens for LLM
Settings.callback_managerCallbackManagerNoneFor observability/tracing

Query Engine Parameters

ParameterTypeDefaultDescription
similarity_top_kint2Number of chunks to retrieve
response_modestr"compact"Response synthesis strategy
streamingboolFalseEnable streaming responses
verboseboolFalseShow retrieval details
text_qa_templatePromptTemplatedefaultOverride QA prompt
refine_templatePromptTemplatedefaultOverride refine prompt

Performance Benchmarks

OperationTypical LatencyNotes
Index 100 documents10-30sOne-time cost, persist to disk
Index 10,000 documents5-15minUse batch embedding, persist
Vector query (top-5)200-500msVector search only
Full RAG query1-3sRetrieval + LLM synthesis
Streaming first token300-600msMuch better perceived latency
Agent with 2 tool calls4-8sMulti-step reasoning

Best Practices

  1. Persist your index. Always call index.storage_context.persist() after building. Re-embedding documents on every startup wastes time and money.

  2. Use VectorStoreIndex as your default. It handles 90% of RAG use cases. Only reach for TreeIndex or SummaryIndex when you have specific summarization needs.

  3. Tune similarity_top_k. Start with 3-5 and adjust. Too few misses relevant context; too many dilutes with noise and increases LLM cost.

  4. Add metadata to documents. Metadata enables filtering, source attribution, and better retrieval. Always include at least source, date, and category.

  5. Use streaming for all user-facing queries. The difference between 2s of silence and immediate partial output fundamentally changes user perception.

  6. Choose the right response mode. compact is the best default. Use tree_summarize for long documents, refine for highest quality (at higher cost), and no_text when you just need retrieved chunks.

  7. Evaluate your RAG system. Use RelevancyEvaluator and FaithfulnessEvaluator to measure quality. A RAG system without evaluation is a guessing game.

  8. Use chat_engine for conversations, not repeated query_engine calls. The chat engine automatically handles history condensation and context management.

  9. Match chunk size to your content. Technical documentation benefits from larger chunks (1000-1500 tokens) to preserve context. Short Q&A pairs work better with smaller chunks (256-512 tokens).

  10. Use separate indices for separate concerns. Do not dump API docs, architecture docs, and meeting notes into one index. Create separate indices and wrap them as tools for an agent that can choose the right source.

Troubleshooting

Query returns "I don't have enough information":

# Increase the number of retrieved chunks query_engine = index.as_query_engine(similarity_top_k=10) # Check what's actually being retrieved retriever = index.as_retriever(similarity_top_k=10) nodes = retriever.retrieve("your query here") for node in nodes: print(f"Score: {node.score:.3f} | {node.text[:100]}") # If scores are low, your chunks may not match the query phrasing

Hallucinated answers (not grounded in context):

# Use a stricter prompt template from llama_index.core import PromptTemplate strict_prompt = PromptTemplate( "Context:\n{context_str}\n\n" "Question: {query_str}\n\n" "IMPORTANT: Only answer from the context above. " "If the answer is not clearly stated in the context, respond with " "'The provided documents do not contain this information.'\n" "Answer: " ) query_engine = index.as_query_engine(text_qa_template=strict_prompt) # Also: evaluate with FaithfulnessEvaluator

Slow indexing on large document sets:

# Use batch processing with progress bar from llama_index.core import VectorStoreIndex from llama_index.core.ingestion import IngestionPipeline from llama_index.core.node_parser import SentenceSplitter pipeline = IngestionPipeline( transformations=[ SentenceSplitter(chunk_size=1024, chunk_overlap=200), OpenAIEmbedding(model="text-embedding-3-small"), ] ) nodes = pipeline.run(documents=documents, show_progress=True) index = VectorStoreIndex(nodes)

Memory issues with large indices:

# Use an external vector store instead of in-memory default from llama_index.vector_stores.chroma import ChromaVectorStore import chromadb db = chromadb.PersistentClient(path="./chroma_db") collection = db.get_or_create_collection("my_docs") vector_store = ChromaVectorStore(chroma_collection=collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Chat engine loses context after many turns:

# Use condense_plus_context mode (re-retrieves on each turn) chat_engine = index.as_chat_engine( chat_mode="condense_plus_context", verbose=True ) # This condenses the full chat history + new question into a standalone query, # then retrieves fresh context each time

LlamaIndex vs LangChain

DimensionLlamaIndexLangChain
Primary focusRAG and data retrievalGeneral LLM applications
RAG qualityBest-in-class (core focus)Good (one of many features)
Data connectors300+ via LlamaHub100+ via community
Index typesVector, Tree, Summary, Keyword, KGVector store wrappers
Response synthesis5+ modes (compact, refine, tree)Basic (stuff, map_reduce)
EvaluationBuilt-in (relevancy, faithfulness)Via LangSmith
Agent supportFunctionAgent, ReActAgentAgentExecutor, tool calling
Learning curveEasy for RAG, moderate for agentsModerate for everything
When to chooseRAG is your primary use caseAgents + tools are primary

Use LlamaIndex when your application is fundamentally about querying data -- document Q&A, knowledge bases, enterprise search, research assistants.

Use LangChain when your application is fundamentally about agent reasoning, tool orchestration, or you need the broadest integration ecosystem.

Use both together when you need LlamaIndex's superior RAG as a tool within a LangChain agent:

# LlamaIndex index as a LangChain tool from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from langchain.tools import Tool # Build LlamaIndex index documents = SimpleDirectoryReader("./docs").load_data() index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine() # Wrap as LangChain tool doc_search_tool = Tool( name="DocumentSearch", func=lambda q: str(query_engine.query(q)), description="Search internal documentation for answers" ) # Use in a LangChain agent from langchain.agents import create_tool_calling_agent, AgentExecutor agent = create_tool_calling_agent(llm, [doc_search_tool, ...], prompt)

Resources

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates