C

Cocoindex Expert

Battle-tested skill for comprehensive, toolkit, developing, cocoindex. Includes structured workflows, validation checks, and reusable patterns for development.

SkillClipticsdevelopmentv1.0.0MIT
0 views0 copies

CocoIndex Data Indexing Skill

A Claude Code skill for building real-time data indexing pipelines with CocoIndex, an open-source framework for transforming, embedding, and indexing data from multiple sources into vector databases and search engines.

When to Use This Skill

Choose this skill when:

  • Building RAG (Retrieval-Augmented Generation) pipelines that need real-time data updates
  • Indexing documents, code, or structured data into vector databases
  • Creating search systems that need incremental processing (only re-process changed data)
  • Building knowledge bases from multiple data sources (files, databases, APIs)
  • Implementing semantic search with automatic embedding generation
  • Setting up data transformation pipelines with chunking, extraction, and enrichment

Consider alternatives when:

  • You need a simple one-time embedding script (use a direct embedding API call)
  • You need a full ETL platform with scheduling and monitoring (use Airflow or Dagster)
  • You need real-time streaming without indexing (use Kafka or Redis Streams)

Quick Start

# Install CocoIndex pip install cocoindex # Set up your environment export OPENAI_API_KEY="your-key" # For embeddings export POSTGRES_URL="postgresql://user:pass@localhost:5432/mydb" # Initialize a new CocoIndex project cocoindex init my-index
import cocoindex @cocoindex.flow_def(name="document_indexer") def document_index_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): # Source: Read documents from a directory data_scope["documents"] = flow_builder.add_source( cocoindex.sources.LocalFile(path="./documents", glob="**/*.md") ) # Transform: Chunk documents doc = data_scope["documents"] doc["chunks"] = doc["content"].transform( cocoindex.transforms.SplitRecursively( chunk_size=500, chunk_overlap=50 ) ) # Embed: Generate vector embeddings chunk = doc["chunks"] chunk["embedding"] = chunk["text"].transform( cocoindex.transforms.Embed( model="text-embedding-3-small" ) ) # Export: Index into vector database flow_builder.add_target( cocoindex.targets.Postgres( table_name="document_chunks", primary_key=["doc_id", "chunk_index"] ), data=chunk )

Core Concepts

Pipeline Components

ComponentPurposeExamples
SourcesRead data from external systemsLocalFile, S3, Database, GitHub, Web
TransformsProcess and enrich dataSplitRecursively, Embed, ExtractJSON, LLMExtract
TargetsWrite indexed data to destinationsPostgres (pgvector), Qdrant, Pinecone, Elasticsearch
ScopingDefine data relationshipsParent-child, one-to-many, nested scopes

Incremental Processing

# CocoIndex tracks data changes automatically # Only modified documents are re-processed on subsequent runs # First run: indexes all 1000 documents cocoindex.run("document_indexer") # Second run: only re-indexes the 3 documents that changed cocoindex.run("document_indexer") # Force full re-index if needed cocoindex.run("document_indexer", force_rebuild=True)

LLM-Powered Extraction

# Extract structured data from unstructured text chunk["metadata"] = chunk["text"].transform( cocoindex.transforms.LLMExtract( model="gpt-4o-mini", schema={ "topic": "string", "entities": ["string"], "sentiment": "positive | negative | neutral", "key_facts": ["string"] }, instruction="Extract structured metadata from this text passage." ) )

Configuration

ParameterTypeDefaultDescription
chunk_sizenumber500Characters per chunk for text splitting
chunk_overlapnumber50Overlapping characters between adjacent chunks
embedding_modelstring"text-embedding-3-small"Model for vector embeddings
embedding_dimensionsnumber1536Vector dimension size
batch_sizenumber100Items per batch for embedding API calls
concurrencynumber4Parallel processing threads
force_rebuildbooleanfalseSkip incremental tracking and reprocess all data
log_levelstring"info"Logging verbosity: debug, info, warning, error

Best Practices

  1. Use incremental processing for large datasets — CocoIndex's change tracking means you only pay embedding API costs for new or modified data; avoid force_rebuild unless you've changed your chunking or embedding strategy.

  2. Choose chunk sizes based on your retrieval use case — smaller chunks (200-500 chars) work better for precise question-answering, while larger chunks (500-1500 chars) preserve more context for summarization tasks.

  3. Set appropriate chunk overlap — 10-20% overlap between chunks prevents losing context at chunk boundaries; too much overlap wastes storage and increases noise in search results.

  4. Use LLM extraction sparingly on large datasets — LLM transforms are powerful but expensive; apply them to summaries or metadata fields rather than running full LLM processing on every chunk.

  5. Index with primary keys for deduplication — always define meaningful primary keys so CocoIndex can update existing records rather than creating duplicates when source data changes.

Common Issues

Embeddings are slow for large document sets — The embedding API is the bottleneck for initial indexing. Increase batch_size (up to the API limit) and concurrency to parallelize. For very large sets, run the initial index overnight and rely on incremental updates afterward.

Chunks lose important context — When text is split at arbitrary character boundaries, important context can be lost. Use SplitRecursively which respects paragraph and sentence boundaries, and increase chunk_overlap to maintain continuity between chunks.

Vector search returns irrelevant results — Poor retrieval quality usually means chunks are too large (diluting the semantic signal) or the embedding model doesn't match your domain. Try reducing chunk_size, switching to a domain-specific embedding model, or adding metadata filters to narrow search scope.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates