Part 2 · Hybrid Retrieval in Practice: FAISS + BM25, Fused with RRF

Sat, 25 Apr 2026 00:00:00 +0000

Part of a series on building , a fully local AI study companion. Previous: .

All abbreviations are fully explained in the appendix at the bottom of the page.

The first version of CogniVault used pure dense retrieval — embed the query with embeddinggemma, search a FAISS index, pass the top-7 chunks to the model. It worked. It worked beautifully — until a user uploaded a PDF containing some German legal text and asked for “§3 Absatz 2.”

The model couldn’t find it.

The chunk was right there. The PDF was indexed. But “§3 Absatz 2” doesn’t embed into anything semantically meaningful — it’s a token-level identifier, not a concept. The dense vector for the query landed nowhere near the dense vector for the chunk, even though the chunk literally contains the string the user asked for.

That bug killed pure dense retrieval for me. This post is about what replaced it.

Two kinds of “similar”

You already use both kinds of search every day. When Spotify builds a “song radio” from a track you like, it’s matching feel — tempo, mood, genre — and it will happily play you a song whose title shares no words with the original. But when you type Bohemian Rhapsody remastered 2011 into the search box, you don’t want feel. You want that exact string, and “a similar operatic rock epic” is a wrong answer.

Search systems formalise that split into two notions of similarity:

Lexical similarity — “do these strings share rare words?” This is what TF-IDF and BM25 model. They thrive on identifiers, names, code, technical terminology, and direct quotes.
Semantic similarity — “do these passages talk about the same idea, even with different words?” This is what embeddings model. They thrive on paraphrase, conceptual queries, and natural-language questions.

Neither subsumes the other. A user asking “how is the practical exam structured?” needs semantic search — the document doesn’t say “structure of practical exam.” A user asking "§3 Absatz 2" needs lexical search — there’s no concept to embed, just a literal string.

Production RAG has to do both. CogniVault does both, and then fuses the result lists with Reciprocal Rank Fusion (RRF).

The stack

Query
 ├── embed via embeddinggemma ──► FAISS IndexFlatIP ──► top-K dense
 └── tokenize + lowercase ──► BM25Okapi ──► top-K sparse
 │
 Reciprocal Rank Fusion ◄──┘
 │
 top-7 fused chunks

Both indexes live in memory, fronted by a VectorDB singleton. FAISS does inner-product search over normalised embeddings (so dot product = cosine). BM25 is rank_bm25’s BM25Okapi, fed the same chunks tokenised by a simple lowercase-and-split tokenizer.

The corpora are kept in lockstep: soft-deleting a file’s chunks triggers a BM25 rebuild over the remaining active chunks, and the singleton reloads both indexes from vector_store.faiss + vector_store.json (chunk metadata + raw text) after every ingestion run and on app start.

Why FAISS `IndexFlatIP`, not HNSW or IVF?

IndexFlatIP is brute-force exact search. It scans every vector, every query. For tens of thousands of chunks that’s fine — sub-millisecond on a laptop. CogniVault is a single-user, local-first app; the index is never going to be billions of vectors. Trading recall for speed via HNSW or IVF would buy nothing here and lose the “exact” guarantee. Boring, correct, fast enough.

When the corpus grows large enough that brute-force gets sticky, switching is a one-line change. Until then, the simplest index wins.

Reciprocal Rank Fusion

The naive way to combine two ranked lists is to score them and add. That sounds reasonable until you remember FAISS returns inner-product scores in some bounded range and BM25 returns scores in an unbounded one — they aren’t comparable without normalisation, and any normalisation you pick is somewhat arbitrary.

RRF sidesteps the problem entirely. It only looks at ranks, not scores. For each result list, an item at rank r contributes 1 / (k + r) to its final score (with k = 60 by convention — large enough to flatten the tail, small enough that the top items still dominate). Items that appear in both lists get summed.

# Simplified — the real implementation also de-duplicates chunks
# by (source, chunk_id, page) before scoring.
def reciprocal_rank_fusion(result_lists, k=60):
 scores = defaultdict(float)
 for results in result_lists:
 for rank, chunk_id in enumerate(results, start=1):
 scores[chunk_id] += 1.0 / (k + rank)
 return sorted(scores.items(), key=lambda kv: kv[1], reverse=True)

That’s the whole algorithm. No tuning, no calibration, no per-corpus weights. A chunk that’s #1 in BM25 and #4 in FAISS easily beats a chunk that’s #2 in only one of them. A chunk that both indexes agree on rises to the top deterministically.

The result for the “§3 Absatz 2” query: BM25 finds the literal match and lands it at rank 1. FAISS finds nothing useful (its top hits are about exam regulations in general). RRF surfaces the BM25 hit at the top of the fused list. Problem solved.

Scope filtering with ContextVar isolation

One detail that’s easy to get wrong: the retriever has to be scope-aware. CogniVault lets users limit a question to a single category or specific files. The scope is set by the request, but the search is called from deep inside the Strands agent loop, which is called from a streaming FastAPI handler, possibly with multiple concurrent requests in flight per worker.

Threading the scope through every function call would be ugly. A global is unsafe. The right primitive is Python’s , which gives you per-task isolated state that asyncio and threads both respect.

from contextvars import ContextVar

_doc_scope: ContextVar[DocScope | None] = ContextVar("doc_scope", default=None)

def set_doc_scope(scope: DocScope | None) -> None:
 _doc_scope.set(scope)

def current_doc_scope() -> DocScope | None:
 return _doc_scope.get()

The /rag request handler sets the scope at the very start of each streaming response; the search tool reads it; because the value is task-local, it dies with the request. No globals, no parameter drilling, no race conditions across concurrent users.

This is one of those design choices that looks like over-engineering until you have two browser tabs open and realise that without it, tab A’s scope filter would leak into tab B’s question.

Chunking choices that pay off downstream

Hybrid retrieval is only as good as the chunks. CogniVault uses a RecursiveCharacterTextSplitter with 1,000 characters, 100 overlap for unstructured text — small enough to keep retrieval precise, large enough to carry context for the model.

For structured formats it switches strategy:

Markdown → MarkdownHeaderTextSplitter emits one chunk per H1/H2/H3 section with the heading hierarchy prepended as a breadcrumb (“Privacy > Vault Audit > Indicators”). BM25 loves breadcrumbs — they make heading-keyword queries match cleanly.
CSV → header row + 20-row batches per chunk, so a query for a column name lands on the right block.
PPTX → one chunk per slide, title and body text together.
XLSX → header + row batches, per sheet, with a [Sheet: name] prefix.

Tiny fragments get filtered: unstructured text needs at least 100 characters to become a chunk, while the structured formats drop the bar to 20 — a two-line Markdown section or a header-only sheet is short but still meaningful. The recursive splitter is well-trodden territory, but the per-format strategies matter much more than people give them credit for.

What I’d do differently

A few things I’d revisit if I were starting over:

Stop tokenising for BM25 with str.split(). It’s fine, but a real tokenizer that handles punctuation and German compounds would meaningfully improve recall on the legal docs.
Add a small reranker. RRF gets the right set, but a cross-encoder rerank on the top 20 would polish the order. Locally-served, of course — there are good small ones now.
Query expansion for thin queries. Two-word questions like “§3 exam” could be expanded via a quick gemma4 call before retrieval. Latency cost, recall gain.

None of those are in the box yet. RRF over FAISS + BM25 is already so much better than either alone that I haven’t felt the pull to optimise further.

The takeaway

If your retrieval is “embed + cosine + top-k,” it will fail in exactly the way mine did — on the queries that contain literal identifiers your model has no embedding for. The fix isn’t a better embedding model. It’s a second retriever that doesn’t pretend everything is a concept.

FAISS for ideas. BM25 for strings. RRF to decide which one was right today.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them
FAISS	Facebook AI Similarity Search	Meta’s library for storing vectors and finding the most similar ones fast
BM25	Best Match 25	A keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	Merges ranked lists using only ranks: each item scores `Σ 1/(k + rank)` across lists
TF-IDF	Term Frequency–Inverse Document Frequency	BM25’s ancestor: score words by how often they appear here vs how rare they are everywhere
IP (in `IndexFlatIP`)	Inner Product	The similarity measure FAISS computes; on normalised vectors it equals cosine similarity
HNSW	Hierarchical Navigable Small World	A popular approximate vector-index structure — deliberately not used here
IVF	Inverted File Index	Another approximate FAISS index type — also deliberately not used
AEVO	Ausbildereignungsverordnung	The German trainer-aptitude regulation whose “§3 Absatz 2” query broke pure dense retrieval
CSV / PPTX / XLSX	Comma-Separated Values / PowerPoint / Excel (Office Open XML)	Structured formats with their own chunking strategies
H1/H2/H3	Heading levels 1–3	The Markdown heading tiers used to split sections

Next up: — how CogniVault’s /rag endpoint streams Gemma 4’s thinking before any tool calls run.

Retrieval |