Cocoindex Expert
Battle-tested skill for comprehensive, toolkit, developing, cocoindex. Includes structured workflows, validation checks, and reusable patterns for development.
CocoIndex Data Indexing Skill
A Claude Code skill for building real-time data indexing pipelines with CocoIndex, an open-source framework for transforming, embedding, and indexing data from multiple sources into vector databases and search engines.
When to Use This Skill
Choose this skill when:
- Building RAG (Retrieval-Augmented Generation) pipelines that need real-time data updates
- Indexing documents, code, or structured data into vector databases
- Creating search systems that need incremental processing (only re-process changed data)
- Building knowledge bases from multiple data sources (files, databases, APIs)
- Implementing semantic search with automatic embedding generation
- Setting up data transformation pipelines with chunking, extraction, and enrichment
Consider alternatives when:
- You need a simple one-time embedding script (use a direct embedding API call)
- You need a full ETL platform with scheduling and monitoring (use Airflow or Dagster)
- You need real-time streaming without indexing (use Kafka or Redis Streams)
Quick Start
# Install CocoIndex pip install cocoindex # Set up your environment export OPENAI_API_KEY="your-key" # For embeddings export POSTGRES_URL="postgresql://user:pass@localhost:5432/mydb" # Initialize a new CocoIndex project cocoindex init my-index
import cocoindex @cocoindex.flow_def(name="document_indexer") def document_index_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): # Source: Read documents from a directory data_scope["documents"] = flow_builder.add_source( cocoindex.sources.LocalFile(path="./documents", glob="**/*.md") ) # Transform: Chunk documents doc = data_scope["documents"] doc["chunks"] = doc["content"].transform( cocoindex.transforms.SplitRecursively( chunk_size=500, chunk_overlap=50 ) ) # Embed: Generate vector embeddings chunk = doc["chunks"] chunk["embedding"] = chunk["text"].transform( cocoindex.transforms.Embed( model="text-embedding-3-small" ) ) # Export: Index into vector database flow_builder.add_target( cocoindex.targets.Postgres( table_name="document_chunks", primary_key=["doc_id", "chunk_index"] ), data=chunk )
Core Concepts
Pipeline Components
| Component | Purpose | Examples |
|---|---|---|
| Sources | Read data from external systems | LocalFile, S3, Database, GitHub, Web |
| Transforms | Process and enrich data | SplitRecursively, Embed, ExtractJSON, LLMExtract |
| Targets | Write indexed data to destinations | Postgres (pgvector), Qdrant, Pinecone, Elasticsearch |
| Scoping | Define data relationships | Parent-child, one-to-many, nested scopes |
Incremental Processing
# CocoIndex tracks data changes automatically # Only modified documents are re-processed on subsequent runs # First run: indexes all 1000 documents cocoindex.run("document_indexer") # Second run: only re-indexes the 3 documents that changed cocoindex.run("document_indexer") # Force full re-index if needed cocoindex.run("document_indexer", force_rebuild=True)
LLM-Powered Extraction
# Extract structured data from unstructured text chunk["metadata"] = chunk["text"].transform( cocoindex.transforms.LLMExtract( model="gpt-4o-mini", schema={ "topic": "string", "entities": ["string"], "sentiment": "positive | negative | neutral", "key_facts": ["string"] }, instruction="Extract structured metadata from this text passage." ) )
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size | number | 500 | Characters per chunk for text splitting |
chunk_overlap | number | 50 | Overlapping characters between adjacent chunks |
embedding_model | string | "text-embedding-3-small" | Model for vector embeddings |
embedding_dimensions | number | 1536 | Vector dimension size |
batch_size | number | 100 | Items per batch for embedding API calls |
concurrency | number | 4 | Parallel processing threads |
force_rebuild | boolean | false | Skip incremental tracking and reprocess all data |
log_level | string | "info" | Logging verbosity: debug, info, warning, error |
Best Practices
-
Use incremental processing for large datasets — CocoIndex's change tracking means you only pay embedding API costs for new or modified data; avoid
force_rebuildunless you've changed your chunking or embedding strategy. -
Choose chunk sizes based on your retrieval use case — smaller chunks (200-500 chars) work better for precise question-answering, while larger chunks (500-1500 chars) preserve more context for summarization tasks.
-
Set appropriate chunk overlap — 10-20% overlap between chunks prevents losing context at chunk boundaries; too much overlap wastes storage and increases noise in search results.
-
Use LLM extraction sparingly on large datasets — LLM transforms are powerful but expensive; apply them to summaries or metadata fields rather than running full LLM processing on every chunk.
-
Index with primary keys for deduplication — always define meaningful primary keys so CocoIndex can update existing records rather than creating duplicates when source data changes.
Common Issues
Embeddings are slow for large document sets — The embedding API is the bottleneck for initial indexing. Increase batch_size (up to the API limit) and concurrency to parallelize. For very large sets, run the initial index overnight and rely on incremental updates afterward.
Chunks lose important context — When text is split at arbitrary character boundaries, important context can be lost. Use SplitRecursively which respects paragraph and sentence boundaries, and increase chunk_overlap to maintain continuity between chunks.
Vector search returns irrelevant results — Poor retrieval quality usually means chunks are too large (diluting the semantic signal) or the embedding model doesn't match your domain. Try reducing chunk_size, switching to a domain-specific embedding model, or adding metadata filters to narrow search scope.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.