RAG Engineer Engine

Systems architecture toolkit for building production RAG pipelines — covering chunking strategy design, embedding selection, vector store architecture, retrieval optimization, and quality evaluation.

When to Use

Use this toolkit when:

Designing a new RAG system architecture from scratch
Optimizing an existing RAG pipeline for better retrieval quality
Selecting between vector databases, embedding models, and chunking strategies
Debugging poor answer quality in RAG-powered applications

Use simpler approaches when:

Documents fit in the LLM context window → direct prompting
Structured data only → SQL queries or API calls
Real-time web data → web search integration

Quick Start

Architecture Decision Template


## RAG Architecture Decision Record

### Document Characteristics
- Type: {pdf, markdown, html, code}
- Volume: {number_of_documents}
- Update frequency: {static, daily, real-time}
- Average length: {pages_or_tokens}

### Query Patterns
- Type: {factual, analytical, comparative}
- Complexity: {simple, multi-hop, aggregation}
- Language: {single, multilingual}

### Recommended Stack
- Chunking: {strategy}
- Embeddings: {model}
- Vector Store: {database}
- Retrieval: {method}
- Generation: {llm}

Chunking Strategy Selection


def select_chunking_strategy(doc_type, avg_length, query_type):
    """Select optimal chunking based on document and query characteristics."""

    if doc_type == "code":
        return {
            "strategy": "ast_based",
            "unit": "function",
            "context": "include class signature and imports",
            "overlap": "none"
        }
    elif doc_type == "technical_docs" and avg_length > 5000:
        return {
            "strategy": "hierarchical",
            "parent_size": 2000,
            "child_size": 500,
            "overlap": 100,
            "preserve": "section headers"
        }
    elif query_type == "multi_hop":
        return {
            "strategy": "semantic",
            "size": 1500,
            "overlap": 300,
            "method": "sentence_transformers_segmentation"
        }
    else:
        return {
            "strategy": "recursive",
            "size": 1000,
            "overlap": 200,
            "separators": ["\n\n", "\n", ". ", " "]
        }

Core Concepts

Embedding Model Selection

Model	Dimensions	Speed	Quality	Best For
text-embedding-3-small	1536	Fast	Good	General purpose, cost-sensitive
text-embedding-3-large	3072	Medium	Excellent	High-accuracy requirements
Cohere embed-v3	1024	Fast	Excellent	Multilingual
BGE-large-en-v1.5	1024	Medium	Very good	Self-hosted, English
Voyage-3	1024	Medium	Excellent	Code and technical docs

Vector Store Selection

Database	Hosting	Filtering	Scale	Best For
FAISS	Self-hosted	None	Billions	Batch processing, research
Pinecone	Managed	Rich	Billions	Production SaaS, zero-ops
Qdrant	Both	Rich	Billions	On-prem with filtering
Weaviate	Both	Rich	Millions	Multi-modal, GraphQL
pgvector	Self-hosted	SQL	Millions	Postgres users, small-medium
ChromaDB	Self-hosted	Basic	Thousands	Prototyping, development

Retrieval Optimization Pipeline

Query → Query Expansion → Hybrid Retrieval → Reranking → Context Assembly
  |          |                   |               |              |
  v          v                   v               v              v
Original  HyDE/Multi-query   BM25+Dense    Cohere Rerank   Deduplicate
query     expansion           fusion        or cross-encoder  + order

Configuration

Component	Parameter	Recommended
Chunking	chunk_size	500-1500 tokens
Chunking	overlap	10-20% of chunk_size
Embeddings	model	text-embedding-3-small
Retrieval	top_k	5-10 initial, 3-5 after rerank
Retrieval	hybrid_weight	0.6 semantic / 0.4 keyword
Reranking	model	Cohere rerank-v3
Generation	max_context	4000-8000 tokens

Best Practices

Profile your queries first — understand query patterns before choosing architecture
Chunk with document awareness — respect section boundaries, headers, and logical units
Always use hybrid retrieval in production — semantic alone misses keyword-dependent queries
Add reranking — a reranker after initial retrieval improves precision significantly
Evaluate retrieval independently — measure Recall@K and MRR before evaluating generation
Version your embeddings — re-embedding is expensive, track which model version generated each vector

Common Issues

Low recall — relevant documents not retrieved: Increase top_k. Try query expansion (HyDE or multi-query). Check that your embedding model handles domain-specific vocabulary. Add keyword search as a fallback.

High recall but low precision — too many irrelevant results: Add a reranking step. Use metadata filtering to narrow scope before vector search. Reduce chunk size to increase granularity.

Embedding drift after model update: All vectors must be re-embedded when changing the embedding model. Never mix vectors from different models. Use a migration pipeline that re-embeds in batches.

⚠️ Loading Issue

Rag Engineer Engine

RAG Engineer Engine

When to Use

Quick Start

Architecture Decision Template

Chunking Strategy Selection

Core Concepts

Embedding Model Selection

Vector Store Selection

Retrieval Optimization Pipeline

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace