<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>BM25 |</title><link>https://aretascodes.dev/tags/bm25/</link><atom:link href="https://aretascodes.dev/tags/bm25/index.xml" rel="self" type="application/rss+xml"/><description>BM25</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 12 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://aretascodes.dev/media/icon_hu_2ab4f4763b27c75b.png</url><title>BM25</title><link>https://aretascodes.dev/tags/bm25/</link></image><item><title>CogniVault Backend Explained, Part 3 · How a Question Becomes a Cited Answer</title><link>https://aretascodes.dev/blog/backend-explained-rag-agent/</link><pubDate>Fri, 12 Jun 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/backend-explained-rag-agent/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You type a question. A few seconds later you get an answer with footnotes — the exact documents and pages it came from. This part walks through everything that happens in between.&lt;/p&gt;
&lt;p&gt;In
we built the knowledge base: every document chunked, embedded, and indexed. Now we get to &lt;em&gt;use&lt;/em&gt; it — and this is where CogniVault stops being a pipeline and starts being interesting.&lt;/p&gt;
&lt;h2 id="two-librarians-because-one-keeps-failing-you"&gt;Two librarians, because one keeps failing you&lt;/h2&gt;
&lt;p&gt;Imagine a library with one librarian who organises everything by &lt;em&gt;vibe&lt;/em&gt;. Ask her about &amp;ldquo;server downtime procedures&amp;rdquo; and she&amp;rsquo;s brilliant — she understands what you mean and finds documents that discuss the concept, whatever words they use. But ask her for &amp;ldquo;Error Code 404B&amp;rdquo; and she shrugs, handing you general networking guides. She doesn&amp;rsquo;t do exact strings.&lt;/p&gt;
&lt;p&gt;Down the hall is a second librarian with a card catalogue. He finds the exact string &amp;ldquo;404B&amp;rdquo; instantly — but ask him a conceptual question phrased differently from the source text, and he finds nothing at all.&lt;/p&gt;
&lt;p&gt;These are the two halves of search:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Semantic search (FAISS)&lt;/strong&gt; — your question is embedded into a vector, and the index finds chunks whose vectors point the same way (technically: cosine similarity — how closely two arrows align). Great for meaning, blind to exact identifiers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keyword search (BM25)&lt;/strong&gt; — a scoring formula that rewards chunks containing your &lt;em&gt;exact&lt;/em&gt; words, weighted by how distinctive those words are. Great for identifiers, blind to synonyms.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;CogniVault asks &lt;strong&gt;both librarians every time&lt;/strong&gt;, then merges their answers with &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt; — a formula that combines ranked lists using only the &lt;em&gt;positions&lt;/em&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;score(chunk) = sum over both lists of 1 / (60 + rank)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A chunk ranked highly by either librarian scores well; a chunk both of them liked floats to the top. The elegance is what&amp;rsquo;s &lt;em&gt;missing&lt;/em&gt;: you never have to reconcile FAISS&amp;rsquo;s similarity scores with BM25&amp;rsquo;s completely different scale, because ranks are the only input. The constant 60 comes straight from the original 2009 research paper, and yes, it&amp;rsquo;s cited in the code.&lt;/p&gt;
&lt;p&gt;A few implementation details worth knowing: both searches deliberately over-fetch (at least 20 candidates each) so the fusion has material to work with; very weak semantic matches are dropped, but a keyword-perfect chunk can still be rescued through fusion; and the final answer uses the top 7 chunks. I benchmarked this whole setup against pure vector search in
if you want the war stories.&lt;/p&gt;
&lt;h2 id="the-agent-a-model-that-decides-for-itself"&gt;The agent: a model that decides for itself&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s the second idea that trips up beginners: CogniVault&amp;rsquo;s chat is not &amp;ldquo;paste chunks into a prompt, get an answer.&amp;rdquo; It&amp;rsquo;s an &lt;strong&gt;agent&lt;/strong&gt; — a model running in a loop where it can &lt;em&gt;choose&lt;/em&gt; to call tools, read their results, and only then answer.&lt;/p&gt;
&lt;p&gt;Built with the Strands Agents SDK, the agent gets six tools:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_knowledge_base&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The core RAG tool — runs the hybrid search above, returns chunks with source and page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list_documents&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;See what&amp;rsquo;s in the vault&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;analyze_document&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Structured analysis of one document: topics, entities, facts, summary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compare_documents&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Answer a question by comparing two documents side by side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;calculator&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe maths — the expression is parsed into a syntax tree and only whitelisted operators run. No &lt;code&gt;eval()&lt;/code&gt;, ever&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;current_time&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The date and time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There is no hard-coded routing. The &lt;em&gt;model&lt;/em&gt; reads your question and decides which tools to call, guided by its system prompt. Ask &amp;ldquo;compare the two contracts on termination clauses&amp;rdquo; and it reaches for &lt;code&gt;compare_documents&lt;/code&gt;; ask &amp;ldquo;what&amp;rsquo;s 15% of 2,340&amp;rdquo; and it uses the calculator instead of hallucinating arithmetic.&lt;/p&gt;
&lt;p&gt;Two safety details I want beginners to notice, because they&amp;rsquo;re the difference between a toy and a product: a &lt;strong&gt;fresh agent is constructed for every request&lt;/strong&gt; (no shared state bleeding between concurrent chats), and the document-analysis tools call the model &lt;em&gt;directly&lt;/em&gt; rather than through the agent — otherwise an agent calling a tool that calls the agent could recurse forever.&lt;/p&gt;
&lt;h2 id="watching-the-model-think"&gt;Watching the model think&lt;/h2&gt;
&lt;p&gt;When you send a message, the response streams back as &lt;strong&gt;NDJSON&lt;/strong&gt; (Newline-Delimited JSON — each line of the stream is its own small JSON object). And it arrives in two phases:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phase 1 — thinking.&lt;/strong&gt; Gemma&amp;rsquo;s reasoning chain streams first, rendered in the collapsible panel above the answer. It&amp;rsquo;s deliberately best-effort: if it fails for any reason, the answer still comes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phase 2 — the agent answer.&lt;/strong&gt; Tools run, citations appear in the Sources panel the moment the search completes — &lt;em&gt;before&lt;/em&gt; the answer finishes writing — and the answer text streams in.&lt;/p&gt;
&lt;div class="mermaid"&gt;flowchart TB
Q["Your question&lt;br/&gt;(plus optional images, files, scope)"] --&gt; P1
subgraph STREAM["POST /rag — one NDJSON stream"]
P1["Phase 1: Thinking&lt;br/&gt;reasoning chunks stream first"]
P1 --&gt; P2["Phase 2: Agent&lt;br/&gt;fresh per request, history restored"]
P2 --&gt;|"decides to call"| T["search_knowledge_base"]
T --&gt; D["FAISS&lt;br/&gt;semantic"]
T --&gt; S["BM25&lt;br/&gt;keywords"]
D --&gt; RRF["RRF fusion — top 7 chunks"]
S --&gt; RRF
RRF --&gt;|"chunks + citations"| P2
P2 --&gt; OUT["citations, then answer text,&lt;br/&gt;then a memory-usage report"]
end
&lt;/div&gt;
&lt;p&gt;Each line in the stream is typed: &lt;code&gt;thinking&lt;/code&gt;, &lt;code&gt;metadata&lt;/code&gt; (a citation), &lt;code&gt;text&lt;/code&gt; (answer), &lt;code&gt;memory&lt;/code&gt; (how full the conversation budget is), or &lt;code&gt;error&lt;/code&gt;. The frontend just reads lines and routes them to the right panel. I dissected this design — and why thinking comes &lt;em&gt;before&lt;/em&gt; the tool calls — in
.&lt;/p&gt;
&lt;h2 id="a-memory-budget-not-a-bottomless-pit"&gt;A memory budget, not a bottomless pit&lt;/h2&gt;
&lt;p&gt;Gemma&amp;rsquo;s context window (the amount of text the model can consider at once) is 128K tokens, but CogniVault doesn&amp;rsquo;t let conversation history sprawl across all of it. Each chat session gets a budget of 48,000 characters — roughly 12,000 tokens. Exceed it, and the &lt;em&gt;oldest&lt;/em&gt; question-answer pair quietly drops out first, keeping the bulk of the window free for what matters: your current question and the retrieved chunks.&lt;/p&gt;
&lt;p&gt;Two resilience touches worth stealing for your own projects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Restart survival.&lt;/strong&gt; In-memory history dies with the process. So the first message in a session after a backend restart rebuilds its history from the chat log the frontend persists. Multi-turn memory survives reboots.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Edit and regenerate.&lt;/strong&gt; Editing an earlier message rewinds the stored history to that point before re-asking — the model genuinely forgets the timeline that no longer exists.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="scope-pinning-the-ai-to-specific-documents"&gt;Scope: pinning the AI to specific documents&lt;/h2&gt;
&lt;p&gt;One last feature, and a lesson about small local models. You can pin a chat to specific files or a category. The filter travels with the request &lt;em&gt;and&lt;/em&gt; a mandatory-search instruction is injected into both the system prompt and the user message itself.&lt;/p&gt;
&lt;p&gt;Why both? Because small models sometimes skip instructions that live only in the system prompt — but they can&amp;rsquo;t ignore what&amp;rsquo;s inside the question. Belt and braces. When you work with 4-billion-parameter models instead of frontier ones, you learn to make instructions impossible to miss rather than hoping they&amp;rsquo;re followed.&lt;/p&gt;
&lt;h2 id="the-takeaway"&gt;The takeaway&lt;/h2&gt;
&lt;p&gt;A cited answer is four systems cooperating: two retrievers covering each other&amp;rsquo;s blind spots, a fusion formula that needs nothing but ranks, an agent that picks its own tools, and a stream that shows its work. None of the four is exotic on its own — the product is the cooperation.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;Retrieve relevant passages from your own documents first; let the model answer from them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;The semantic (meaning-based) half of hybrid search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;The keyword half — a classic ranking formula from the Okapi information-retrieval system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;Merges the two ranked lists using only each chunk&amp;rsquo;s rank: &lt;code&gt;score = Σ 1/(60 + rank)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NDJSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Newline-Delimited JSON&lt;/td&gt;
&lt;td&gt;A stream where each line is its own complete JSON object — the chat response format&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The universal text format for structured data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstract Syntax Tree&lt;/td&gt;
&lt;td&gt;The parsed form of an expression — how the calculator does maths without &lt;code&gt;eval()&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large Language Model&lt;/td&gt;
&lt;td&gt;A neural network trained on huge amounts of text that can read and generate language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Software Development Kit&lt;/td&gt;
&lt;td&gt;A library of building blocks — here, Strands, which provides the agent loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;K&lt;/strong&gt; (in 128K)&lt;/td&gt;
&lt;td&gt;Kilo (thousand)&lt;/td&gt;
&lt;td&gt;128K tokens ≈ 128,000 tokens — Gemma&amp;rsquo;s context window&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— the same machinery pointed at generating quizzes, workshops, flashcards, and mindmaps, plus a table of every byte the app stores and exactly where it lives.&lt;/p&gt;</description></item><item><title>Part 2 · Hybrid Retrieval in Practice: FAISS + BM25, Fused with RRF</title><link>https://aretascodes.dev/blog/hybrid-retrieval-faiss-bm25-rrf/</link><pubDate>Sat, 25 Apr 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/hybrid-retrieval-faiss-bm25-rrf/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
, a fully local AI study companion. Previous:
.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The first version of CogniVault used pure dense retrieval — embed the query with &lt;code&gt;embeddinggemma&lt;/code&gt;, search a FAISS index, pass the top-7 chunks to the model. It worked. It worked &lt;em&gt;beautifully&lt;/em&gt; — until a user uploaded a PDF containing some German legal text and asked for &amp;ldquo;§3 Absatz 2.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;The model couldn&amp;rsquo;t find it.&lt;/p&gt;
&lt;p&gt;The chunk was &lt;em&gt;right there&lt;/em&gt;. The PDF was indexed. But &amp;ldquo;§3 Absatz 2&amp;rdquo; doesn&amp;rsquo;t embed into anything semantically meaningful — it&amp;rsquo;s a token-level identifier, not a concept. The dense vector for the query landed nowhere near the dense vector for the chunk, even though the chunk literally contains the string the user asked for.&lt;/p&gt;
&lt;p&gt;That bug killed pure dense retrieval for me. This post is about what replaced it.&lt;/p&gt;
&lt;h2 id="two-kinds-of-similar"&gt;Two kinds of &amp;ldquo;similar&amp;rdquo;&lt;/h2&gt;
&lt;p&gt;You already use both kinds of search every day. When Spotify builds a &amp;ldquo;song radio&amp;rdquo; from a track you like, it&amp;rsquo;s matching &lt;em&gt;feel&lt;/em&gt; — tempo, mood, genre — and it will happily play you a song whose title shares no words with the original. But when you type &lt;code&gt;Bohemian Rhapsody remastered 2011&lt;/code&gt; into the search box, you don&amp;rsquo;t want &lt;em&gt;feel&lt;/em&gt;. You want that exact string, and &amp;ldquo;a similar operatic rock epic&amp;rdquo; is a wrong answer.&lt;/p&gt;
&lt;p&gt;Search systems formalise that split into two notions of similarity:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Lexical similarity&lt;/strong&gt; — &amp;ldquo;do these strings share rare words?&amp;rdquo; This is what TF-IDF and BM25 model. They thrive on identifiers, names, code, technical terminology, and direct quotes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic similarity&lt;/strong&gt; — &amp;ldquo;do these passages talk about the same idea, even with different words?&amp;rdquo; This is what embeddings model. They thrive on paraphrase, conceptual queries, and natural-language questions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Neither subsumes the other. A user asking &lt;em&gt;&amp;ldquo;how is the practical exam structured?&amp;rdquo;&lt;/em&gt; needs &lt;strong&gt;semantic&lt;/strong&gt; search — the document doesn&amp;rsquo;t say &amp;ldquo;structure of practical exam.&amp;rdquo; A user asking &lt;em&gt;&amp;quot;§3 Absatz 2&amp;quot;&lt;/em&gt; needs &lt;strong&gt;lexical&lt;/strong&gt; search — there&amp;rsquo;s no concept to embed, just a literal string.&lt;/p&gt;
&lt;p&gt;Production RAG has to do both. CogniVault does both, and then fuses the result lists with &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="the-stack"&gt;The stack&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Query
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ├── embed via embeddinggemma ──► FAISS IndexFlatIP ──► top-K dense
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── tokenize + lowercase ──► BM25Okapi ──► top-K sparse
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; Reciprocal Rank Fusion ◄──┘
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; top-7 fused chunks
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Both indexes live &lt;strong&gt;in memory&lt;/strong&gt;, fronted by a &lt;code&gt;VectorDB&lt;/code&gt; singleton. FAISS does inner-product search over normalised embeddings (so dot product = cosine). BM25 is &lt;code&gt;rank_bm25&lt;/code&gt;&amp;rsquo;s &lt;code&gt;BM25Okapi&lt;/code&gt;, fed the same chunks tokenised by a simple lowercase-and-split tokenizer.&lt;/p&gt;
&lt;p&gt;The corpora are kept in lockstep: soft-deleting a file&amp;rsquo;s chunks triggers a BM25 rebuild over the remaining active chunks, and the singleton reloads both indexes from &lt;code&gt;vector_store.faiss&lt;/code&gt; + &lt;code&gt;vector_store.json&lt;/code&gt; (chunk metadata + raw text) after every ingestion run and on app start.&lt;/p&gt;
&lt;h2 id="why-faiss-indexflatip-not-hnsw-or-ivf"&gt;Why FAISS &lt;code&gt;IndexFlatIP&lt;/code&gt;, not HNSW or IVF?&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;IndexFlatIP&lt;/code&gt; is brute-force exact search. It scans every vector, every query. For tens of thousands of chunks that&amp;rsquo;s fine — sub-millisecond on a laptop. CogniVault is a &lt;strong&gt;single-user, local-first&lt;/strong&gt; app; the index is never going to be billions of vectors. Trading recall for speed via HNSW or IVF would buy nothing here and lose the &amp;ldquo;exact&amp;rdquo; guarantee. Boring, correct, fast enough.&lt;/p&gt;
&lt;p&gt;When the corpus grows large enough that brute-force gets sticky, switching is a one-line change. Until then, the simplest index wins.&lt;/p&gt;
&lt;h2 id="reciprocal-rank-fusion"&gt;Reciprocal Rank Fusion&lt;/h2&gt;
&lt;p&gt;The naive way to combine two ranked lists is to score them and add. That sounds reasonable until you remember FAISS returns inner-product scores in some bounded range and BM25 returns scores in an unbounded one — they aren&amp;rsquo;t comparable without normalisation, and any normalisation you pick is somewhat arbitrary.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RRF sidesteps the problem entirely.&lt;/strong&gt; It only looks at &lt;em&gt;ranks&lt;/em&gt;, not scores. For each result list, an item at rank &lt;code&gt;r&lt;/code&gt; contributes &lt;code&gt;1 / (k + r)&lt;/code&gt; to its final score (with &lt;code&gt;k = 60&lt;/code&gt; by convention — large enough to flatten the tail, small enough that the top items still dominate). Items that appear in both lists get summed.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified — the real implementation also de-duplicates chunks&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# by (source, chunk_id, page) before scoring.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reciprocal_rank_fusion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_lists&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result_lists&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;kv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;That&amp;rsquo;s the whole algorithm. No tuning, no calibration, no per-corpus weights. A chunk that&amp;rsquo;s #1 in BM25 and #4 in FAISS easily beats a chunk that&amp;rsquo;s #2 in only one of them. A chunk that &lt;em&gt;both&lt;/em&gt; indexes agree on rises to the top deterministically.&lt;/p&gt;
&lt;p&gt;The result for the &amp;ldquo;§3 Absatz 2&amp;rdquo; query: BM25 finds the literal match and lands it at rank 1. FAISS finds nothing useful (its top hits are about exam regulations in general). RRF surfaces the BM25 hit at the top of the fused list. Problem solved.&lt;/p&gt;
&lt;h2 id="scope-filtering-with-contextvar-isolation"&gt;Scope filtering with ContextVar isolation&lt;/h2&gt;
&lt;p&gt;One detail that&amp;rsquo;s easy to get wrong: the retriever has to be &lt;em&gt;scope-aware&lt;/em&gt;. CogniVault lets users limit a question to a single category or specific files. The scope is set by the request, but the search is called from deep inside the Strands agent loop, which is called from a streaming FastAPI handler, possibly with multiple concurrent requests in flight per worker.&lt;/p&gt;
&lt;p&gt;Threading the scope through every function call would be ugly. A global is unsafe. The right primitive is Python&amp;rsquo;s
, which gives you per-task isolated state that asyncio and threads both respect.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;contextvars&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ContextVar&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;_doc_scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContextVar&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DocScope&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ContextVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;doc_scope&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_doc_scope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DocScope&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;_doc_scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;current_doc_scope&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DocScope&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_doc_scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;/rag&lt;/code&gt; request handler sets the scope at the very start of each streaming response; the search tool reads it; because the value is task-local, it dies with the request. No globals, no parameter drilling, no race conditions across concurrent users.&lt;/p&gt;
&lt;p&gt;This is one of those design choices that looks like over-engineering until you have two browser tabs open and realise that without it, tab A&amp;rsquo;s scope filter would leak into tab B&amp;rsquo;s question.&lt;/p&gt;
&lt;h2 id="chunking-choices-that-pay-off-downstream"&gt;Chunking choices that pay off downstream&lt;/h2&gt;
&lt;p&gt;Hybrid retrieval is only as good as the chunks. CogniVault uses a &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; with &lt;strong&gt;1,000 characters, 100 overlap&lt;/strong&gt; for unstructured text — small enough to keep retrieval precise, large enough to carry context for the model.&lt;/p&gt;
&lt;p&gt;For structured formats it switches strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Markdown&lt;/strong&gt; → &lt;code&gt;MarkdownHeaderTextSplitter&lt;/code&gt; emits one chunk per H1/H2/H3 section with the heading hierarchy prepended as a breadcrumb (&amp;ldquo;Privacy &amp;gt; Vault Audit &amp;gt; Indicators&amp;rdquo;). BM25 loves breadcrumbs — they make heading-keyword queries match cleanly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CSV&lt;/strong&gt; → header row + 20-row batches per chunk, so a query for a column name lands on the right block.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PPTX&lt;/strong&gt; → one chunk per slide, title and body text together.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;XLSX&lt;/strong&gt; → header + row batches, per sheet, with a &lt;code&gt;[Sheet: name]&lt;/code&gt; prefix.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tiny fragments get filtered: unstructured text needs at least &lt;strong&gt;100 characters&lt;/strong&gt; to become a chunk, while the structured formats drop the bar to &lt;strong&gt;20&lt;/strong&gt; — a two-line Markdown section or a header-only sheet is short but still meaningful. The recursive splitter is well-trodden territory, but the per-format strategies matter much more than people give them credit for.&lt;/p&gt;
&lt;h2 id="what-id-do-differently"&gt;What I&amp;rsquo;d do differently&lt;/h2&gt;
&lt;p&gt;A few things I&amp;rsquo;d revisit if I were starting over:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Stop tokenising for BM25 with &lt;code&gt;str.split()&lt;/code&gt;.&lt;/strong&gt; It&amp;rsquo;s fine, but a real tokenizer that handles punctuation and German compounds would meaningfully improve recall on the legal docs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add a small reranker.&lt;/strong&gt; RRF gets the right &lt;em&gt;set&lt;/em&gt;, but a cross-encoder rerank on the top 20 would polish the &lt;em&gt;order&lt;/em&gt;. Locally-served, of course — there are good small ones now.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query expansion for thin queries.&lt;/strong&gt; Two-word questions like &amp;ldquo;§3 exam&amp;rdquo; could be expanded via a quick &lt;code&gt;gemma4&lt;/code&gt; call before retrieval. Latency cost, recall gain.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of those are in the box yet. RRF over FAISS + BM25 is already so much better than either alone that I haven&amp;rsquo;t felt the pull to optimise further.&lt;/p&gt;
&lt;h2 id="the-takeaway"&gt;The takeaway&lt;/h2&gt;
&lt;p&gt;If your retrieval is &amp;ldquo;embed + cosine + top-k,&amp;rdquo; it will fail in exactly the way mine did — on the queries that contain literal identifiers your model has no embedding for. The fix isn&amp;rsquo;t a better embedding model. It&amp;rsquo;s a second retriever that doesn&amp;rsquo;t pretend everything is a concept.&lt;/p&gt;
&lt;p&gt;FAISS for ideas. BM25 for strings. RRF to decide which one was right today.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;Retrieve relevant passages from your own documents first; let the model answer from them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;Meta&amp;rsquo;s library for storing vectors and finding the most similar ones fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;A keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;Merges ranked lists using only ranks: each item scores &lt;code&gt;Σ 1/(k + rank)&lt;/code&gt; across lists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TF-IDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Term Frequency–Inverse Document Frequency&lt;/td&gt;
&lt;td&gt;BM25&amp;rsquo;s ancestor: score words by how often they appear here vs how rare they are everywhere&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IP&lt;/strong&gt; (in &lt;code&gt;IndexFlatIP&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Inner Product&lt;/td&gt;
&lt;td&gt;The similarity measure FAISS computes; on normalised vectors it equals cosine similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HNSW&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hierarchical Navigable Small World&lt;/td&gt;
&lt;td&gt;A popular &lt;em&gt;approximate&lt;/em&gt; vector-index structure — deliberately not used here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IVF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inverted File Index&lt;/td&gt;
&lt;td&gt;Another approximate FAISS index type — also deliberately not used&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AEVO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ausbildereignungsverordnung&lt;/td&gt;
&lt;td&gt;The German trainer-aptitude regulation whose &amp;ldquo;§3 Absatz 2&amp;rdquo; query broke pure dense retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CSV / PPTX / XLSX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Comma-Separated Values / PowerPoint / Excel (Office Open XML)&lt;/td&gt;
&lt;td&gt;Structured formats with their own chunking strategies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H1/H2/H3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Heading levels 1–3&lt;/td&gt;
&lt;td&gt;The Markdown heading tiers used to split sections&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— how CogniVault&amp;rsquo;s &lt;code&gt;/rag&lt;/code&gt; endpoint streams Gemma 4&amp;rsquo;s &lt;em&gt;thinking&lt;/em&gt; before any tool calls run.&lt;/p&gt;</description></item></channel></rss>