Ultimate RAG with FAISS

Build high-performance vector similarity search systems using Facebook AI's FAISS library — supporting billion-scale datasets with GPU acceleration, approximate nearest neighbors, and production-optimized index types.

When to Use

Choose FAISS when:

Need fast similarity search on large datasets (millions to billions of vectors)
GPU acceleration is available and throughput matters
Pure vector similarity without complex metadata filtering
Batch processing of embeddings at scale
Self-hosted deployment with full control

Consider alternatives when:

Need rich metadata filtering → Qdrant, Pinecone, or Weaviate
Want managed infrastructure → Pinecone (serverless)
Small dataset (< 100K vectors) → ChromaDB or pgvector
Need real-time updates with consistency → Qdrant or Weaviate

Quick Start

Installation


# CPU only
pip install faiss-cpu

# GPU support (CUDA)
pip install faiss-gpu

Basic Similarity Search


import faiss
import numpy as np

# Create vectors (e.g., from embedding model)
dimension = 1536  # text-embedding-3-small dimension
num_vectors = 100000
vectors = np.random.random((num_vectors, dimension)).astype('float32')

# Build index
index = faiss.IndexFlatL2(dimension)  # Exact L2 distance
index.add(vectors)

print(f"Total vectors: {index.ntotal}")

# Search
query = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query, k=5)  # Top 5 nearest

print(f"Nearest indices: {indices[0]}")
print(f"Distances: {distances[0]}")

Production Index (IVF + PQ)


import faiss

dimension = 1536
num_vectors = 10_000_000

# IVF (Inverted File) + PQ (Product Quantization)
nlist = 1000  # Number of clusters
m = 64        # Number of sub-quantizers
nbits = 8     # Bits per sub-quantizer

quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

# Train index on representative sample
training_data = vectors[:100000]  # Use subset for training
index.train(training_data)
index.add(vectors)

# Search with probe parameter
index.nprobe = 50  # Search 50 clusters (accuracy vs speed tradeoff)
distances, indices = index.search(query, k=10)

Core Concepts

Index Types

Index	Time	Memory	Accuracy	Best For
`IndexFlatL2`	O(n)	Full	Exact	< 1M vectors, baseline
`IndexIVFFlat`	O(n/nlist)	Full	High	1-10M, good accuracy
`IndexIVFPQ`	O(n/nlist)	Low	Good	10M-1B, memory constrained
`IndexHNSWFlat`	O(log n)	Full	Very high	1-100M, low latency
`IndexIVFScalarQuantizer`	O(n/nlist)	Medium	High	1-100M, balanced

GPU Acceleration


import faiss

# Move index to GPU
gpu_resource = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(gpu_resource, 0, cpu_index)  # GPU 0

# Multi-GPU
gpu_index = faiss.index_cpu_to_all_gpus(cpu_index)

# Search on GPU (10-50x faster)
distances, indices = gpu_index.search(queries, k=10)

Persistence


# Save index to disk
faiss.write_index(index, "my_index.faiss")

# Load index from disk
index = faiss.read_index("my_index.faiss")

Configuration

Parameter	Default	Description
`dimension`	—	Vector dimension (must match embeddings)
`nlist`	sqrt(n)	Number of IVF clusters
`nprobe`	1	Clusters to search (accuracy vs speed)
`m`	8	PQ sub-quantizers (memory vs accuracy)
`nbits`	8	Bits per sub-quantizer code
`metric`	L2	Distance metric (L2 or inner product)

Tuning Guidelines

Dataset Size	Index	nlist	nprobe	Expected Recall@10
< 1M	IndexFlatL2	—	—	100% (exact)
1-10M	IndexIVFFlat	1000	50	~97%
10-100M	IndexIVFPQ	4096	128	~90%
100M-1B	IndexIVFPQ	16384	256	~85%

Best Practices

Train on representative data — IVF training data should match the distribution of your full dataset
Tune nprobe for your accuracy/speed tradeoff — start at nprobe=nlist/10 and adjust
Use HNSW for low-latency requirements — best query time without GPU
GPU for throughput — batch queries of 100+ to fully utilize GPU
Normalize vectors for cosine similarity — use faiss.normalize_L2() before adding
Save and version indexes — rebuilding large indexes takes hours

Common Issues

Low recall with IVF index: Increase nprobe. If still low, increase nlist and retrain. Ensure training data is representative of the full dataset.

Out of memory: Use PQ compression (IndexIVFPQ) to reduce memory by 10-50x. Use memory-mapped indexes for datasets larger than RAM. Consider sharding across multiple machines.

Slow training: Use a representative subset (10-50K vectors) for training, not the full dataset. Move training to GPU with index_cpu_to_gpu.

⚠️ Loading Issue

Ultimate Rag Faiss