CocoIndex Data Indexing Skill

A Claude Code skill for building real-time data indexing pipelines with CocoIndex, an open-source framework for transforming, embedding, and indexing data from multiple sources into vector databases and search engines.

When to Use This Skill

Choose this skill when:

Building RAG (Retrieval-Augmented Generation) pipelines that need real-time data updates
Indexing documents, code, or structured data into vector databases
Creating search systems that need incremental processing (only re-process changed data)
Building knowledge bases from multiple data sources (files, databases, APIs)
Implementing semantic search with automatic embedding generation
Setting up data transformation pipelines with chunking, extraction, and enrichment

Consider alternatives when:

You need a simple one-time embedding script (use a direct embedding API call)
You need a full ETL platform with scheduling and monitoring (use Airflow or Dagster)
You need real-time streaming without indexing (use Kafka or Redis Streams)

Quick Start


# Install CocoIndex
pip install cocoindex

# Set up your environment
export OPENAI_API_KEY="your-key"  # For embeddings
export POSTGRES_URL="postgresql://user:pass@localhost:5432/mydb"

# Initialize a new CocoIndex project
cocoindex init my-index


import cocoindex

@cocoindex.flow_def(name="document_indexer")
def document_index_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Source: Read documents from a directory
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="./documents", glob="**/*.md")
    )

    # Transform: Chunk documents
    doc = data_scope["documents"]
    doc["chunks"] = doc["content"].transform(
        cocoindex.transforms.SplitRecursively(
            chunk_size=500,
            chunk_overlap=50
        )
    )

    # Embed: Generate vector embeddings
    chunk = doc["chunks"]
    chunk["embedding"] = chunk["text"].transform(
        cocoindex.transforms.Embed(
            model="text-embedding-3-small"
        )
    )

    # Export: Index into vector database
    flow_builder.add_target(
        cocoindex.targets.Postgres(
            table_name="document_chunks",
            primary_key=["doc_id", "chunk_index"]
        ),
        data=chunk
    )

Core Concepts

Pipeline Components

Component	Purpose	Examples
Sources	Read data from external systems	LocalFile, S3, Database, GitHub, Web
Transforms	Process and enrich data	SplitRecursively, Embed, ExtractJSON, LLMExtract
Targets	Write indexed data to destinations	Postgres (pgvector), Qdrant, Pinecone, Elasticsearch
Scoping	Define data relationships	Parent-child, one-to-many, nested scopes

Incremental Processing


# CocoIndex tracks data changes automatically
# Only modified documents are re-processed on subsequent runs

# First run: indexes all 1000 documents
cocoindex.run("document_indexer")

# Second run: only re-indexes the 3 documents that changed
cocoindex.run("document_indexer")

# Force full re-index if needed
cocoindex.run("document_indexer", force_rebuild=True)

LLM-Powered Extraction


# Extract structured data from unstructured text
chunk["metadata"] = chunk["text"].transform(
    cocoindex.transforms.LLMExtract(
        model="gpt-4o-mini",
        schema={
            "topic": "string",
            "entities": ["string"],
            "sentiment": "positive | negative | neutral",
            "key_facts": ["string"]
        },
        instruction="Extract structured metadata from this text passage."
    )
)

Configuration

Parameter	Type	Default	Description
`chunk_size`	number	`500`	Characters per chunk for text splitting
`chunk_overlap`	number	`50`	Overlapping characters between adjacent chunks
`embedding_model`	string	`"text-embedding-3-small"`	Model for vector embeddings
`embedding_dimensions`	number	`1536`	Vector dimension size
`batch_size`	number	`100`	Items per batch for embedding API calls
`concurrency`	number	`4`	Parallel processing threads
`force_rebuild`	boolean	`false`	Skip incremental tracking and reprocess all data
`log_level`	string	`"info"`	Logging verbosity: debug, info, warning, error

Best Practices

Use incremental processing for large datasets — CocoIndex's change tracking means you only pay embedding API costs for new or modified data; avoid force_rebuild unless you've changed your chunking or embedding strategy.
Choose chunk sizes based on your retrieval use case — smaller chunks (200-500 chars) work better for precise question-answering, while larger chunks (500-1500 chars) preserve more context for summarization tasks.
Set appropriate chunk overlap — 10-20% overlap between chunks prevents losing context at chunk boundaries; too much overlap wastes storage and increases noise in search results.
Use LLM extraction sparingly on large datasets — LLM transforms are powerful but expensive; apply them to summaries or metadata fields rather than running full LLM processing on every chunk.
Index with primary keys for deduplication — always define meaningful primary keys so CocoIndex can update existing records rather than creating duplicates when source data changes.

Common Issues

Embeddings are slow for large document sets — The embedding API is the bottleneck for initial indexing. Increase batch_size (up to the API limit) and concurrency to parallelize. For very large sets, run the initial index overnight and rely on incremental updates afterward.

Chunks lose important context — When text is split at arbitrary character boundaries, important context can be lost. Use SplitRecursively which respects paragraph and sentence boundaries, and increase chunk_overlap to maintain continuity between chunks.

Vector search returns irrelevant results — Poor retrieval quality usually means chunks are too large (diluting the semantic signal) or the embedding model doesn't match your domain. Try reducing chunk_size, switching to a domain-specific embedding model, or adding metadata filters to narrow search scope.

⚠️ Loading Issue

Cocoindex Expert

CocoIndex Data Indexing Skill

When to Use This Skill

Quick Start

Core Concepts

Pipeline Components

Incremental Processing

LLM-Powered Extraction

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace