<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>AI |</title><link>https://aretascodes.dev/tags/ai/</link><atom:link href="https://aretascodes.dev/tags/ai/index.xml" rel="self" type="application/rss+xml"/><description>AI</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 25 May 2026 00:00:00 +0000</lastBuildDate><image><url>https://aretascodes.dev/media/icon_hu_2ab4f4763b27c75b.png</url><title>AI</title><link>https://aretascodes.dev/tags/ai/</link></image><item><title>Gemma CogniVault</title><link>https://aretascodes.dev/projects/cognivault/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/projects/cognivault/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Gemma CogniVault&lt;/strong&gt; is a 100% local, privacy-first AI study companion. Your documents stay on your hardware. Inference runs via Ollama on &lt;code&gt;localhost&lt;/code&gt;. No telemetry, no embeddings sent to third parties, no exceptions. A live Privacy Vault Audit Panel confirms zero external connections at runtime.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s also genuinely capable — Gemma 4&amp;rsquo;s full surface (completion, vision, tools, reasoning) running on your laptop, wrapped in an app that turns your documents into &lt;strong&gt;quizzes, multi-lesson workshops, flashcard decks, and visual mindmaps&lt;/strong&gt;, with a learning-progress dashboard and 25 achievement badges.&lt;/p&gt;
&lt;h2 id="whats-inside"&gt;What&amp;rsquo;s inside&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM &amp;amp; Embeddings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ollama · &lt;code&gt;gemma4:e4b&lt;/code&gt; · &lt;code&gt;embeddinggemma&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Framework&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Strands Agents SDK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FastAPI · Python 3.10+ · Pydantic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector Search&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FAISS IndexFlatIP + BM25Okapi · Reciprocal Rank Fusion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Document Parsing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;pypdf · python-docx · python-pptx · openpyxl · trafilatura&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OCR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;pytesseract · pymupdf · Pillow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;faster-whisper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Workflow Engine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DBOS + PostgreSQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;React 19 · TypeScript · Vite · Tailwind v4 · Framer Motion · TanStack Query&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="four-sections"&gt;Four sections&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;What it&amp;rsquo;s for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;💬 Chat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ask anything about your documents. Cited answers, scope filter, voice, attachments.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;📚 Knowledge Base&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upload, categorise, and manage your documents. SHA-256 change detection on re-upload.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;🎓 Study Hub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Four AI-powered study modes: Quiz · Workshop · Flashcards · Mindmaps.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;📊 Dashboard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total study time, current streak, 25 achievement badges, 90-day activity heatmap.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="highlights"&gt;Highlights&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;🧠 Thinking Mode&lt;/strong&gt; — collapsible reasoning panel streams Gemma 4&amp;rsquo;s chain of thought before the answer&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;🔍 Hybrid Retrieval&lt;/strong&gt; — FAISS dense + BM25 keyword fused with Reciprocal Rank Fusion&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;🖼️ Multimodal&lt;/strong&gt; — attach images, PDFs, and DOCX inline in chat&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;🛟 Durable workflows&lt;/strong&gt; — DBOS-checkpointed ingestion; crash-safe and resumable&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;🏆 25 achievement badges&lt;/strong&gt; — auto-tracked across chat, quizzes, workshops, flashcards, mindmaps&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;🔒 Vault Audit Panel&lt;/strong&gt; — live &amp;ldquo;zero external connections&amp;rdquo; indicator&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="writing-about-it"&gt;Writing about it&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;m publishing a series of posts unpacking the engineering decisions behind CogniVault — privacy framing, the retrieval stack, the agent loop, ingestion durability, getting JSON out of a local model, drawing mindmaps without a graph library, the gamification layer, and how the test suite avoids needing any infrastructure to run.&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;See the
for the full series.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="try-it"&gt;Try it&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;git clone https://github.com/ndimoforaretas/local-gemma-rag.git
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; local-gemma-rag
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;./scripts/setup.sh &lt;span class="c1"&gt;# one-time&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;./scripts/start.sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then open
.&lt;/p&gt;</description></item><item><title>Part 8 · Testing a Local-AI App: 351 Tests, Zero Infrastructure</title><link>https://aretascodes.dev/blog/testing-local-ai-app/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/testing-local-ai-app/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
. Previously:
.
All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CogniVault has &lt;strong&gt;351 tests across 22 files&lt;/strong&gt; (at the time of writing — the suite grows with the app). None of them need Ollama. None of them need Postgres. None of them need a real PDF, a microphone, or an internet connection. The whole suite runs in &lt;strong&gt;about three seconds&lt;/strong&gt; on my laptop.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s not because there isn&amp;rsquo;t much to test — the surface is wide. It&amp;rsquo;s because the test suite is built around one principle: &lt;strong&gt;mock at the edge, real everywhere else.&lt;/strong&gt; This post is about what &amp;ldquo;the edge&amp;rdquo; means in a local-AI app, and how to draw the line so the suite stays useful instead of decorative.&lt;/p&gt;
&lt;h2 id="the-22-test-files"&gt;The 22 test files&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;What it covers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_api.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The HTTP endpoints (upload, ingest, RAG, history, KB browsing)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_tools.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Calculator, clock, KB search tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_thinking.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Two-phase stream, thinking tokens, session isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_chat_attachments.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multi-file attach, PDF/DOCX extraction, size limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_chat_memory.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Session history budget, trimming, restart rebuild&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_doc_scope_filter.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-request ContextVar isolation, search filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_doc_tools.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;list_documents&lt;/code&gt;, &lt;code&gt;analyze_document&lt;/code&gt;, &lt;code&gt;compare_documents&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_edit_regenerate.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;History rewind, trim_history_to_turns validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_structure_chunking.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Markdown header splits, CSV row batches, doc types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_ocr_fallback.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OCR trigger threshold, graceful degradation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_new_formats.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PPTX, XLSX, HTML extractors, extension routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_docx_url.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DOCX ingestion and URL import (with the SSRF guard)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_reingest.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SHA-256 change detection, idempotency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_vector_db.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BM25, FAISS, RRF fusion, hybrid search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_audio.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Whisper transcription endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_progress.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sessions, daily aggregation, achievement criteria&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_prompts.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The prompt-template loader and custom overrides&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_vault_stats.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The Privacy Vault Audit numbers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_quiz.py&lt;/code&gt; / &lt;code&gt;test_workshop.py&lt;/code&gt; / &lt;code&gt;test_flashcards.py&lt;/code&gt; / &lt;code&gt;test_mindmaps.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-mode parsing, endpoints, achievements&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Everything that &lt;em&gt;can&lt;/em&gt; be tested in isolation is tested in isolation. Everything that needs to be tested through the FastAPI layer is, but the &lt;em&gt;only&lt;/em&gt; things mocked are the calls that cross the process boundary.&lt;/p&gt;
&lt;h2 id="what-gets-mocked-what-doesnt"&gt;What gets mocked, what doesn&amp;rsquo;t&lt;/h2&gt;
&lt;p&gt;The single most important question in a project like this: where do you stub?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;React&lt;/span&gt; &lt;span class="n"&gt;frontend&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt; &lt;span class="n"&gt;handlers&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;tested&lt;/span&gt; &lt;span class="n"&gt;directly&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="n"&gt;TestClient&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;tested&lt;/span&gt; &lt;span class="n"&gt;directly&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rag_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generators&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;├─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;BM25&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;real&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fast&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;├─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;SQLite&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;real&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;against&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;tmp_path&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;├─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;DBOS&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;patched&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt; &lt;span class="n"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;no&lt;/span&gt; &lt;span class="n"&gt;Postgres&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;├─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;Ollama&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;patched&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s import site&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;└─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;Whisper&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;stubbed&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt; &lt;span class="mi"&gt;145&lt;/span&gt; &lt;span class="n"&gt;MB&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The rule of thumb: &lt;strong&gt;anything that crosses a process or network boundary, mock. Anything in-process, run for real.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;FAISS and BM25 are real because they&amp;rsquo;re libraries we link into the test process. SQLite is real because it&amp;rsquo;s a file. DBOS is patched because launching it expects a Postgres connection, and that&amp;rsquo;s network. Ollama is patched because it&amp;rsquo;s HTTP. Whisper is stubbed because loading a 145 MB model in a unit test is silly.&lt;/p&gt;
&lt;p&gt;That principle keeps the test suite fast (no I/O the OS can&amp;rsquo;t handle in milliseconds) and meaningful (the real code paths through retrieval, chunking, parsing, scope filtering all execute).&lt;/p&gt;
&lt;h2 id="mocking-ollama"&gt;Mocking Ollama&lt;/h2&gt;
&lt;p&gt;Most CogniVault tests need &lt;em&gt;some&lt;/em&gt; model output, but they don&amp;rsquo;t care what model produced it. Each service imports the &lt;code&gt;ollama&lt;/code&gt; module directly, so the tests patch that reference &lt;strong&gt;at the service&amp;rsquo;s own import site&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Real pattern from test_quiz.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;unittest.mock&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;patch&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;backend.services&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;quiz_generator&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_quiz_parses_questions&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;fake&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;message&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;questions&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;VALID_MCQ&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;})}}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;patch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quiz_generator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;ollama&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mock_ollama&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mock_ollama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;return_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quiz_generator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate_quiz&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;difficulty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;beginner&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_questions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question_types&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;mcq&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A streaming variant feeds chunk sequences instead of a single response, used by the RAG and thinking tests. The key property: one &lt;code&gt;patch.object&lt;/code&gt; against the module the service actually uses. No deep mock hierarchies, no fragile string paths into third-party internals. Easy to read in a code review, easy to debug when a test fails.&lt;/p&gt;
&lt;h2 id="mocking-dbos"&gt;Mocking DBOS&lt;/h2&gt;
&lt;p&gt;DBOS expects &lt;code&gt;launch()&lt;/code&gt; to connect to Postgres. The shared &lt;code&gt;client&lt;/code&gt; fixture in &lt;code&gt;conftest.py&lt;/code&gt; simply patches the &lt;code&gt;dbos&lt;/code&gt; instance before the app is exercised:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Real pattern from conftest.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;A FastAPI TestClient with DBOS launch mocked out — no Postgres needed.&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;backend.services.ingest.dbos&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mock_dbos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mock_dbos&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;launch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MagicMock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;backend.main&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;TestClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The decorated workflow steps still execute as ordinary Python functions — we lose the durability semantics, but the tests aren&amp;rsquo;t testing durability, they&amp;rsquo;re testing the &lt;em&gt;business logic inside the steps&lt;/em&gt; (hash detection, extraction, chunking). The durability layer has its own tests upstream, in DBOS&amp;rsquo;s own suite.&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s a second isolation layer that runs on &lt;strong&gt;every&lt;/strong&gt; test automatically: an autouse fixture points the docs folder, FAISS index, and metadata file at a per-test &lt;code&gt;tmp_path&lt;/code&gt; via environment variables, so no test can ever touch real data on disk.&lt;/p&gt;
&lt;h2 id="real-sqlite-with-one-override"&gt;Real SQLite, with one override&lt;/h2&gt;
&lt;p&gt;Progress tracking, achievements, quiz storage, deck CRUD — all SQLite. The progress tracker exposes a single test seam: a module-level path override.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Real pattern from test_quiz.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;autouse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_isolate_progress_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;monkeypatch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;monkeypatch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;setattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;progress_tracker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;_db_path_override&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_path&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;progress_test.db&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Every test gets a fresh database file; the schema auto-creates on first use. No connection pooling drama, no leaked state between tests, no in-memory &lt;code&gt;:memory:&lt;/code&gt; gymnastics. Just a temp file per test.&lt;/p&gt;
&lt;p&gt;This is the kind of test that catches bugs an SQL-level mock would never see — a missing index, a botched migration, a constraint that doesn&amp;rsquo;t fire. SQLite is fast enough on every machine I&amp;rsquo;ve ever owned that &amp;ldquo;use the real database&amp;rdquo; isn&amp;rsquo;t even a trade-off.&lt;/p&gt;
&lt;h2 id="the-testclient-pattern"&gt;The TestClient pattern&lt;/h2&gt;
&lt;p&gt;For HTTP tests, FastAPI&amp;rsquo;s &lt;code&gt;TestClient&lt;/code&gt; runs the app in-process. The upload, the validation, the chunking, the vector-store update, the response serialisation — every layer runs for real. Only the calls that would leave the process (the Ollama embedding call inside ingestion, the model call inside generation) are patched. That&amp;rsquo;s the right line: the test verifies the &lt;em&gt;integration&lt;/em&gt; of those layers, but doesn&amp;rsquo;t depend on an external service.&lt;/p&gt;
&lt;p&gt;The streaming endpoint tests use a slightly different style — they iterate the response body and parse each &lt;strong&gt;NDJSON&lt;/strong&gt; line (one JSON envelope per line, as described in
) — but the principle is identical.&lt;/p&gt;
&lt;h2 id="coverage-gaps-i-accept"&gt;Coverage gaps I accept&lt;/h2&gt;
&lt;p&gt;Three things the test suite &lt;em&gt;doesn&amp;rsquo;t&lt;/em&gt; cover:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The frontend.&lt;/strong&gt; No React testing in this suite — that&amp;rsquo;s a separate concern. Most failures show up in API tests anyway, because the frontend is a thin client over a typed API.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real Ollama prompt quality.&lt;/strong&gt; Whether &lt;code&gt;gemma4:e4b&lt;/code&gt; actually produces &lt;em&gt;useful&lt;/em&gt; quiz questions is not a thing tests can answer. That&amp;rsquo;s evaluation, not testing. It belongs in a separate harness with a real model running.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Race conditions across DBOS workflow restarts.&lt;/strong&gt; The resume path is exercised at the logic level, but the full state space of &amp;ldquo;what happens if Postgres goes away at this exact instant&amp;rdquo; is too large to enumerate.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These are conscious gaps. The test suite is for catching regressions in code I wrote; it&amp;rsquo;s not a replacement for evaluation, integration testing, or actual chaos engineering.&lt;/p&gt;
&lt;h2 id="what-the-suite-is-actually-for"&gt;What the suite is actually for&lt;/h2&gt;
&lt;p&gt;Two things, in order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Refactor confidence.&lt;/strong&gt; When I rip out the agent loop and put a new one in, do the tests still pass? If yes, the API contracts I care about haven&amp;rsquo;t drifted.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PR review surface.&lt;/strong&gt; Every PR runs the suite in CI. A green run is a precondition for merge. The suite is loud enough that a real regression makes the noise.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Notice what it &lt;em&gt;isn&amp;rsquo;t&lt;/em&gt; for: proving the model works. It can&amp;rsquo;t. Tests can pin behaviour but they can&amp;rsquo;t pin quality. That&amp;rsquo;s a different muscle, and it belongs in a different harness.&lt;/p&gt;
&lt;h2 id="whats-worth-borrowing"&gt;What&amp;rsquo;s worth borrowing&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;re building a local-AI app and your tests need Ollama running:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Patch the &lt;code&gt;ollama&lt;/code&gt; module at each service&amp;rsquo;s import site with &lt;code&gt;patch.object(service_module, &amp;quot;ollama&amp;quot;)&lt;/code&gt; — one seam per service, no shims required.&lt;/li&gt;
&lt;li&gt;Give your DB layer a path override and run against a &lt;code&gt;tmp_path&lt;/code&gt; SQLite file.&lt;/li&gt;
&lt;li&gt;Use an autouse fixture to redirect every on-disk artefact (docs folder, index files) to &lt;code&gt;tmp_path&lt;/code&gt;, so no test can touch real data even by accident.&lt;/li&gt;
&lt;li&gt;For each external service (model, audio, workflow engine), draw the seam at the process boundary. Test everything above it with real code.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is a suite where every test runs in any environment, finishes in milliseconds, and exercises the actual integration of every layer of code you wrote. 351 tests in about three seconds isn&amp;rsquo;t an optimisation, it&amp;rsquo;s a side-effect of mocking only at the edges.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Continuous Integration&lt;/td&gt;
&lt;td&gt;Automatically running the test suite on every push/PR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pull Request&lt;/td&gt;
&lt;td&gt;A proposed code change — merged only when the suite is green&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The HTTP surface the TestClient exercises in-process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HyperText Transfer Protocol&lt;/td&gt;
&lt;td&gt;The protocol the (in-process) endpoint tests speak&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;The retrieval-then-answer pipeline under test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Knowledge Base&lt;/td&gt;
&lt;td&gt;The indexed document collection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;Real in tests — it&amp;rsquo;s an in-process library&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;The keyword index — also real in tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;The rank-merging formula covered by &lt;code&gt;test_vector_db.py&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SQLite / SQL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;(SQL = Structured Query Language)&lt;/td&gt;
&lt;td&gt;The real, file-based database every progress test runs against&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database-Oriented Operating System&lt;/td&gt;
&lt;td&gt;The durable-workflow library — patched so no Postgres is needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OCR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optical Character Recognition&lt;/td&gt;
&lt;td&gt;The scanned-PDF fallback with its own trigger-threshold tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-Side Request Forgery&lt;/td&gt;
&lt;td&gt;The URL-import attack class covered in &lt;code&gt;test_docx_url.py&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NDJSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Newline-Delimited JSON&lt;/td&gt;
&lt;td&gt;The streaming format the endpoint tests parse line by line&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SHA-256&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secure Hash Algorithm, 256-bit&lt;/td&gt;
&lt;td&gt;The content fingerprint behind the re-ingest tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CRUD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Create, Read, Update, Delete&lt;/td&gt;
&lt;td&gt;The basic storage operations for decks, quizzes, and maps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF / DOCX / PPTX / XLSX / HTML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Portable Document Format / Word / PowerPoint / Excel / HyperText Markup Language&lt;/td&gt;
&lt;td&gt;The extractor formats with dedicated tests&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;That&amp;rsquo;s the series. Eight posts on the parts of
I&amp;rsquo;m most proud of — and a handful I&amp;rsquo;d build differently. If any of it was useful to you, the code is open source at
, and the
is on YouTube.&lt;/p&gt;
&lt;p&gt;Your data. Your hardware. Your AI. Your vault.&lt;/p&gt;</description></item><item><title>Part 5 · Getting Reliable JSON Out of a Local LLM</title><link>https://aretascodes.dev/blog/reliable-json-local-llm/</link><pubDate>Sun, 10 May 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/reliable-json-local-llm/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
. Previously:
.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CogniVault&amp;rsquo;s Study Hub generates four kinds of structured artefacts from your documents: quizzes, multi-lesson workshops, flashcard decks, and mindmaps. All four need the model to return structured JSON, not prose. All four ride on Gemma 4 running locally via Ollama. And all four would fail far too often if I trusted the model to &amp;ldquo;just return JSON.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the defensive pattern that brings that failure rate close to zero — and what to do about the cases that still get through.&lt;/p&gt;
&lt;h2 id="the-pattern"&gt;The pattern&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="mf"&gt;1.&lt;/span&gt; &lt;span class="n"&gt;Retrieve&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;hybrid&lt;/span&gt; &lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="n"&gt;restricted&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;selected&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="mf"&gt;2.&lt;/span&gt; &lt;span class="n"&gt;Prompt&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;strict&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="n"&gt;explicit&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt; &lt;span class="n"&gt;rules&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="mf"&gt;3.&lt;/span&gt; &lt;span class="n"&gt;Generate&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;json&amp;#34;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grammar&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;constrained&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="mf"&gt;4.&lt;/span&gt; &lt;span class="n"&gt;Parse&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tolerant&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;object&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;fenced&lt;/span&gt; &lt;span class="n"&gt;shapes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;trailing&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;comma&lt;/span&gt; &lt;span class="n"&gt;repair&lt;/span&gt; &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="mf"&gt;5.&lt;/span&gt; &lt;span class="n"&gt;Validate&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;drop&lt;/span&gt; &lt;span class="n"&gt;malformed&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="n"&gt;rather&lt;/span&gt; &lt;span class="n"&gt;than&lt;/span&gt; &lt;span class="n"&gt;fail&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;whole&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="mf"&gt;6.&lt;/span&gt; &lt;span class="n"&gt;Retry&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;workshop&lt;/span&gt; &lt;span class="n"&gt;outline&lt;/span&gt; &lt;span class="n"&gt;retries&lt;/span&gt; &lt;span class="n"&gt;once&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;stronger&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="mf"&gt;7.&lt;/span&gt; &lt;span class="n"&gt;Persist&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;SQLite&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;progress&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;so&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="n"&gt;can&lt;/span&gt; &lt;span class="n"&gt;come&lt;/span&gt; &lt;span class="n"&gt;back&lt;/span&gt; &lt;span class="n"&gt;later&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Every generator in CogniVault follows it. The interesting moves are 2, 4, and 5.&lt;/p&gt;
&lt;h2 id="step-3-formatjson-does-real-work"&gt;Step 3: &lt;code&gt;format=&amp;quot;json&amp;quot;&lt;/code&gt; does real work&lt;/h2&gt;
&lt;p&gt;Ollama exposes a &lt;code&gt;format=&amp;quot;json&amp;quot;&lt;/code&gt; option that puts the model under a &lt;strong&gt;grammar constraint&lt;/strong&gt; during sampling. The decoder won&amp;rsquo;t emit tokens that would make the output invalid JSON. It&amp;rsquo;s not perfect — schemas are bigger than &amp;ldquo;valid JSON,&amp;rdquo; and the model can still produce well-formed garbage — but it eliminates the entire class of &amp;ldquo;the model started writing prose before the closing brace&amp;rdquo; failures.&lt;/p&gt;
&lt;p&gt;If your local-LLM stack supports a grammar option (Ollama, llama.cpp, vLLM, etc.), turn it on. It&amp;rsquo;s not free (sampling is slightly slower) but the failure-mode improvement is enormous. Without it, you&amp;rsquo;ll spend most of your error budget on truncated objects.&lt;/p&gt;
&lt;h2 id="step-2-schema-in-prompt-that-the-model-can-actually-obey"&gt;Step 2: schema-in-prompt that the model can actually obey&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;format=&amp;quot;json&amp;quot;&lt;/code&gt; guarantees the &lt;em&gt;shape&lt;/em&gt; of the output is JSON. It says nothing about whether the JSON matches your domain schema. That&amp;rsquo;s the prompt&amp;rsquo;s job.&lt;/p&gt;
&lt;p&gt;The pattern that works for me: instead of dumping a formal JSON Schema and saying &amp;ldquo;obey this,&amp;rdquo; include a &lt;strong&gt;filled-in example&lt;/strong&gt; that shows the model the exact shape, plus explicit counts. Here&amp;rsquo;s the heart of CogniVault&amp;rsquo;s real quiz template (it lives as an editable Markdown file in &lt;code&gt;backend/prompts/quiz.md&lt;/code&gt;):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Output ONLY a single JSON object — no prose, no markdown fences,
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;no text outside the JSON.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;NUMBER OF QUESTIONS: EXACTLY $num_questions. This is a hard requirement.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;OUTPUT SCHEMA:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &amp;#34;questions&amp;#34;: [
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &amp;#34;type&amp;#34;: one of [$types_csv],
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &amp;#34;question&amp;#34;: the question text (string, no leading numbering),
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &amp;#34;options&amp;#34;: array of strings (length 4 for mcq, length 2 for true_false),
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &amp;#34;correct_index&amp;#34;: integer index into options (0-based),
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &amp;#34;explanation&amp;#34;: 1-2 sentence explanation of the correct answer
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; },
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ... exactly $num_questions entries
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A few choices that matter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Show the shape, don&amp;rsquo;t describe it.&lt;/strong&gt; &amp;ldquo;Each item has a &lt;code&gt;type&lt;/code&gt; field&amp;rdquo; gets ignored more often than the literal example.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pin the count.&lt;/strong&gt; &amp;ldquo;EXACTLY 10&amp;rdquo; — repeated, in capitals, as a hard requirement — is much more reliable than &amp;ldquo;around 10.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Index, don&amp;rsquo;t repeat.&lt;/strong&gt; The correct answer is &lt;code&gt;correct_index&lt;/code&gt;, an integer pointing into &lt;code&gt;options&lt;/code&gt; — not the answer text again. Repeated text invites paraphrase drift (&amp;ldquo;Paris&amp;rdquo; vs &amp;ldquo;Paris, France&amp;rdquo;), and then your grading comparison breaks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;One artefact per call.&lt;/strong&gt; I tried generating a full workshop (outline + every lesson) in one call. The model&amp;rsquo;s quality degrades sharply as the response grows. Splitting into outline-first, lesson-on-demand is the two-pass strategy below.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="step-4-parse-tolerantly"&gt;Step 4: parse, tolerantly&lt;/h2&gt;
&lt;p&gt;Even with &lt;code&gt;format=&amp;quot;json&amp;quot;&lt;/code&gt;, two parsing problems survive in practice.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The shape surprise.&lt;/strong&gt; This one bit me in production: I&amp;rsquo;d assumed the model would return a bare JSON array of questions. With &lt;code&gt;format=&amp;quot;json&amp;quot;&lt;/code&gt;, Gemma consistently returns an &lt;strong&gt;object&lt;/strong&gt; — &lt;code&gt;{&amp;quot;questions&amp;quot;: [...]}&lt;/code&gt; — and for a while the parser only accepted the array. Result: a 502 on every quiz generation until I found it. The fix is a parser that meets the model where it is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified from backend/services/quiz_generator.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_items&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extract_json_object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;extract_json_array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;continue&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_json_lenient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="c1"&gt;# bare array&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;questions&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# the expected object shape&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Lexical glitches.&lt;/strong&gt; Occasionally a trailing comma slips through. The repair is deliberately narrow — one regex pass, then give up:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_json_lenient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;repaired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;,(\s*[\]}])&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;\1&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# strip trailing commas&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repaired&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;I don&amp;rsquo;t try to balance brackets, complete truncated strings, or guess at missing fields. Either the output is fixable with a trailing-comma pass and some substring extraction, or it isn&amp;rsquo;t, and we move to step 5.&lt;/p&gt;
&lt;h2 id="step-5-drop-malformed-items-dont-fail-the-batch"&gt;Step 5: drop malformed items, don&amp;rsquo;t fail the batch&lt;/h2&gt;
&lt;p&gt;This is the call that took me a while to make peace with.&lt;/p&gt;
&lt;p&gt;When the model returns 10 quiz questions but #7 is missing its &lt;code&gt;options&lt;/code&gt; field, the temptation is to error out and regenerate the whole batch. &lt;em&gt;Don&amp;rsquo;t&lt;/em&gt;. Validate each item independently and drop the ones that fail.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# CogniVault does this with explicit field checks into a dataclass;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# pydantic works just as well.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;questions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;raw_item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed_items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;validate_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allowed_types&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# returns None if malformed&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The user gets 9 questions instead of 10. They don&amp;rsquo;t notice. Re-running the whole generation to fix question #7 takes 30 seconds and might introduce &lt;em&gt;new&lt;/em&gt; failures in questions 1-6. The dropped-item approach is strictly better UX. (The model also sometimes overshoots the count — the validated list is simply trimmed back to what was asked for.)&lt;/p&gt;
&lt;h2 id="step-6-the-outline-retries-once"&gt;Step 6: the outline retries once&lt;/h2&gt;
&lt;p&gt;Workshops are the exception that proves the rule. A workshop is a structured outline (title, summary, lesson list) plus each lesson&amp;rsquo;s content. The outline &lt;em&gt;must&lt;/em&gt; parse — there&amp;rsquo;s no partial success for a table of contents — so a parse failure there triggers exactly &lt;strong&gt;one&lt;/strong&gt; retry, with the prompt re-sent plus a stern reminder: &amp;ldquo;Your previous response was unparseable. Output ONLY a single valid JSON object.&amp;rdquo; If the second attempt fails too, the user gets a clear error suggesting a narrower scope.&lt;/p&gt;
&lt;p&gt;One retry, not three. Three retries when the model is consistently confused is just wasted seconds and watts.&lt;/p&gt;
&lt;p&gt;The lessons themselves, interestingly, are &lt;strong&gt;not JSON at all&lt;/strong&gt;. A lesson body is prose — forcing it into a JSON string would buy nothing and cost escaping headaches. Lessons are generated as plain Markdown, then run through a small cleanup pass that strips chat-isms the model sometimes adds despite instructions (&amp;ldquo;I hope this helps!&amp;rdquo;, &amp;ldquo;Let me know if…&amp;rdquo;). Different output, different contract.&lt;/p&gt;
&lt;h2 id="two-pass-outline-first-lessons-on-demand"&gt;Two-pass: outline first, lessons on demand&lt;/h2&gt;
&lt;p&gt;Workshops use a two-pass generation pattern:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Pass 1 — generate outline: {&amp;#34;title&amp;#34;: ..., &amp;#34;lessons&amp;#34;: [{&amp;#34;title&amp;#34;: ...}, ...]} (cheap, JSON)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Pass 2 — for each lesson: a full Markdown lesson body (on demand)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The outline is fast and lets the user see the shape of the workshop immediately. Each lesson is generated when the user opens it — meaning the user is &lt;em&gt;reading&lt;/em&gt; lesson 1 while deciding whether they even want lesson 5. The total wall-clock time to &amp;ldquo;first useful content&amp;rdquo; is small even for a 10-lesson workshop.&lt;/p&gt;
&lt;p&gt;This is the same architectural move the chat side makes with
: split a slow operation into a tiny fast part and a larger slow part, hand the user the fast part immediately.&lt;/p&gt;
&lt;h2 id="what-i-learned-so-far-putting-those-generators-together"&gt;What I learned so far putting those generators together&lt;/h2&gt;
&lt;p&gt;A few principles distilled from the four generators:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Use the grammar option in your inference stack.&lt;/strong&gt; Don&amp;rsquo;t try to coax JSON out of a free-form decoder.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pin every quantifier in the prompt.&lt;/strong&gt; &amp;ldquo;Exactly 10,&amp;rdquo; &amp;ldquo;exactly 4 options,&amp;rdquo; &amp;ldquo;one or two sentences.&amp;rdquo; Vague counts = inconsistent output.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Don&amp;rsquo;t assume the top-level shape.&lt;/strong&gt; Grammar-constrained Gemma likes objects; your code might expect arrays. Accept both — the parser is cheaper than relying on the model to return the expected shape.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Drop, don&amp;rsquo;t fail.&lt;/strong&gt; Lossy success beats brittle perfection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;One retry, never more.&lt;/strong&gt; If two tries can&amp;rsquo;t produce valid output, the prompt is wrong, not the model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Split large generations.&lt;/strong&gt; Outline + lessons. Skeleton + body. Two small calls beat one big one almost every time. And if a part of the output is naturally prose, let it &lt;em&gt;be&lt;/em&gt; prose.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Local LLMs in 2026 are good enough that structured generation is genuinely usable for production-shaped features. They are not so good that you can skip the defensive scaffolding. The scaffolding above is maybe 80 lines of code total across all four generators, and it&amp;rsquo;s the difference between &amp;ldquo;demo-quality&amp;rdquo; and &amp;ldquo;I trust this enough to ship.&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The structured text format the generators must produce&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large Language Model&lt;/td&gt;
&lt;td&gt;A neural network trained on huge amounts of text that can read and generate language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Artificial Intelligence&lt;/td&gt;
&lt;td&gt;Software performing tasks that normally need human intelligence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCQ&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multiple-Choice Question&lt;/td&gt;
&lt;td&gt;One of the two quiz question types (the other is true/false)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User Experience&lt;/td&gt;
&lt;td&gt;Why 9 valid questions beat a regeneration error&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SQLite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;(SQL = Structured Query Language)&lt;/td&gt;
&lt;td&gt;The single-file database where generated artefacts persist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database-Oriented Operating System&lt;/td&gt;
&lt;td&gt;The durable-workflow library from the previous post&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP 502&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bad Gateway (HyperText Transfer Protocol status code)&lt;/td&gt;
&lt;td&gt;The error my array-only parser produced until I accepted Gemma&amp;rsquo;s object shape&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— what hand-rolling an SVG radial layout taught me, and why version two uses React Flow anyway.&lt;/p&gt;</description></item><item><title>Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts</title><link>https://aretascodes.dev/blog/two-phase-streaming-strands-agents/</link><pubDate>Thu, 30 Apr 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/two-phase-streaming-strands-agents/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
. Previously:
.
All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When I first wired up Gemma 4 with
inside CogniVault, the chat felt slow. Not laggy — slow in a way that&amp;rsquo;s worse than laggy. The user types a question. The cursor sits there. Then, eventually, an answer drops out of the void.&lt;/p&gt;
&lt;p&gt;The model wasn&amp;rsquo;t idle. It was &lt;em&gt;thinking&lt;/em&gt;. Gemma 4 has a chain-of-thought mode that produces a (sometimes long) reasoning trace before its final reply. With a single-phase agent stream, all of that thinking is happening &lt;em&gt;inside the agent loop&lt;/em&gt; — silently — before any tool calls run, before any tokens get emitted to the UI.&lt;/p&gt;
&lt;p&gt;So I split the call into two phases.&lt;/p&gt;
&lt;h2 id="the-shape"&gt;The shape&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;POST /rag
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ├── Phase 1 — Direct Ollama call, thinking enabled
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ stream: {&amp;#34;type&amp;#34;:&amp;#34;thinking&amp;#34;,&amp;#34;data&amp;#34;:&amp;#34;...&amp;#34;} (reasoning tokens)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── Phase 2 — Strands Agent (thinking disabled)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;metadata&amp;#34;,&amp;#34;data&amp;#34;:{...}} (citations, as soon as search runs)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;text&amp;#34;,&amp;#34;data&amp;#34;:&amp;#34;...&amp;#34;} (answer tokens)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;memory&amp;#34;,&amp;#34;data&amp;#34;:{...}} (end-of-stream: session memory usage)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The endpoint streams &lt;strong&gt;newline-delimited JSON&lt;/strong&gt; (NDJSON): each line of the response body is one self-contained JSON envelope with a &lt;code&gt;type&lt;/code&gt; and a &lt;code&gt;data&lt;/code&gt;. The frontend dispatches on &lt;code&gt;type&lt;/code&gt; and renders accordingly: a &lt;strong&gt;collapsible reasoning panel&lt;/strong&gt; for the thinking tokens, the main message bubble for the text tokens, a sidebar card per citation.&lt;/p&gt;
&lt;p&gt;The user sees the model start thinking &lt;em&gt;immediately&lt;/em&gt;. Latency to first byte drops from &amp;ldquo;long enough to wonder if it crashed&amp;rdquo; to &amp;ldquo;instant.&amp;rdquo; Total time to final answer doesn&amp;rsquo;t change. Perceived speed does.&lt;/p&gt;
&lt;h2 id="phase-1--thinking-only"&gt;Phase 1 — Thinking only&lt;/h2&gt;
&lt;p&gt;Phase 1 is a single direct call to Ollama with thinking enabled. It gets exactly what Phase 2 will see — the same system prompt, the current question, and any attached images — so the reasoning reflects reality. Only the &lt;em&gt;reasoning&lt;/em&gt; tokens are consumed; whatever answer text Phase 1 starts to produce is discarded, because we don&amp;rsquo;t want a half-formed answer competing with the real one.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified from backend/services/rag_agent.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ollama_host&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;system&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;user&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;images&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Phase 1 is deliberately &lt;strong&gt;best-effort&lt;/strong&gt;: any failure here is swallowed and logged, and the stream moves straight on to Phase 2. A broken reasoning panel should never cost the user their answer.&lt;/p&gt;
&lt;h2 id="phase-2--agent-with-tools"&gt;Phase 2 — Agent with tools&lt;/h2&gt;
&lt;p&gt;Phase 2 builds a &lt;strong&gt;fresh Strands &lt;code&gt;Agent&lt;/code&gt; per request&lt;/strong&gt; — no shared mutable state between concurrent chats — restores the session&amp;rsquo;s conversation history into it, and runs the tool loop with six tools registered:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_knowledge_base(query)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hybrid FAISS + BM25 search, top-7, RRF fusion. Scope-filter-aware.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list_documents()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inventory of every indexed file with type and chunk count.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;analyze_document(filename)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inner Gemma call → structured summary (topics, entities, key facts).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compare_documents(doc_a, doc_b, question)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inner Gemma call answering across two documents.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;calculator(expression)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe AST evaluator — no &lt;code&gt;eval()&lt;/code&gt;, no arbitrary code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;current_time()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Timestamp for time-aware queries.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The agent decides which tools to call and in what order. There&amp;rsquo;s no hard-coded router; the system prompt explains what&amp;rsquo;s available and Strands handles the loop. For most document questions the path is: &lt;code&gt;search_knowledge_base&lt;/code&gt; → answer. For comparisons: &lt;code&gt;compare_documents&lt;/code&gt; → answer. For &amp;ldquo;what files do I have?&amp;rdquo;: &lt;code&gt;list_documents&lt;/code&gt; → answer. For greetings and arithmetic, the system prompt tells the agent it may skip search entirely. The model picks.&lt;/p&gt;
&lt;p&gt;Two details that took debugging to get right:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phase 2 runs with thinking explicitly disabled.&lt;/strong&gt; Without that flag, Gemma&amp;rsquo;s default behaviour can leak &lt;code&gt;&amp;lt;think&amp;gt;…&amp;lt;/think&amp;gt;&lt;/code&gt; tags into the visible answer, and everything before the closing tag gets swallowed by the Markdown renderer. One model option — &lt;code&gt;options={&amp;quot;thinking&amp;quot;: False}&lt;/code&gt; — fixed a &amp;ldquo;truncated responses&amp;rdquo; bug that looked much scarier than it was.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Citations are flushed before the first answer token.&lt;/strong&gt; Tools run before text deltas arrive, so by the time the first visible token streams, every source the search found is already in the sidebar. The accumulator is a request-local &lt;code&gt;ContextVar&lt;/code&gt; the search tool appends to.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified — the real loop reads Strands&amp;#39; raw event dicts&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stream_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;event&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;contentBlockDelta&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;delta&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new_citations&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="c1"&gt;# drain the ContextVar accumulator&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="why-this-matters-more-than-it-sounds"&gt;Why this matters more than it sounds&lt;/h2&gt;
&lt;p&gt;You could implement similar behaviour with one agent call that interleaves &lt;code&gt;thinking&lt;/code&gt; events with &lt;code&gt;text&lt;/code&gt; events. The reasons I split it anyway:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The thinking model and the tool model can be different.&lt;/strong&gt; Right now they&amp;rsquo;re both &lt;code&gt;gemma4:e4b&lt;/code&gt;, but the architecture lets me swap a smaller, faster model in for Phase 1 reasoning and keep the big one for Phase 2 tool use. I&amp;rsquo;m not doing that yet — but I want the option.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Phase 1 always streams immediately.&lt;/strong&gt; A pure agent loop only starts producing tokens after the model has decided what to say. Two-phase guarantees the user sees activity almost as soon as they press Enter, regardless of how complex the Phase 2 tool work gets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Failures isolate.&lt;/strong&gt; If Phase 2 falls over (Ollama timeout, tool error), Phase 1&amp;rsquo;s reasoning is still visible — the user can see &lt;em&gt;what the model was trying to do&lt;/em&gt;, which makes the error far less frustrating than a blank &amp;ldquo;something went wrong.&amp;rdquo;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="contextvar-isolation-again"&gt;ContextVar isolation, again&lt;/h2&gt;
&lt;p&gt;The same &lt;code&gt;ContextVar&lt;/code&gt; trick that scopes retrieval in
carries here. At the start of each &lt;code&gt;/rag&lt;/code&gt; stream, the handler sets two request-local variables: the &lt;strong&gt;document-scope filter&lt;/strong&gt; and the &lt;strong&gt;citation accumulator&lt;/strong&gt;. The agent&amp;rsquo;s tools read and write them implicitly. Conversation history itself lives in a per-session store guarded by per-session &lt;code&gt;asyncio&lt;/code&gt; locks, so two concurrent requests in the same chat can&amp;rsquo;t corrupt each other either.&lt;/p&gt;
&lt;p&gt;Tested with two browser tabs open on the same backend, scoped to different document categories, sending overlapping queries simultaneously. Zero cross-contamination. The test suite covers this explicitly in &lt;code&gt;test_thinking.py&lt;/code&gt; and &lt;code&gt;test_doc_scope_filter.py&lt;/code&gt; — see
for the broader story.&lt;/p&gt;
&lt;h2 id="the-frontend-side-of-the-contract"&gt;The frontend side of the contract&lt;/h2&gt;
&lt;p&gt;A detail that tripped me up: this is a &lt;code&gt;POST&lt;/code&gt; endpoint, so the browser&amp;rsquo;s &lt;code&gt;EventSource&lt;/code&gt; API (which only does GET) is out. The frontend uses &lt;code&gt;fetch&lt;/code&gt; and reads the response body incrementally, splitting on newlines and parsing each line as JSON:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-tsx" data-lang="tsx"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// Simplified from useRagStream.ts
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;/rag&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;POST&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;body&lt;/span&gt;: &lt;span class="kt"&gt;JSON.stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getReader&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;decoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;TextDecoder&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;read&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;decoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;: &lt;span class="kt"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;\n&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// keep the trailing partial line
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;line&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kr"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;line&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;appendThinking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;appendText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;addCitation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;memory&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;updateMemoryMeter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The reasoning panel starts &lt;strong&gt;collapsed&lt;/strong&gt;, with a small pulsing indicator while thinking tokens are still streaming — enough to signal &amp;ldquo;the model is working&amp;rdquo; without shoving a wall of chain-of-thought at the user. One click expands the full trace, during or after the stream.&lt;/p&gt;
&lt;h2 id="what-id-revisit"&gt;What I&amp;rsquo;d revisit&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 reasons toward a full answer, and we throw the answer part away.&lt;/strong&gt; A dedicated &amp;ldquo;plan your approach, don&amp;rsquo;t answer yet&amp;rdquo; prompt for Phase 1 would make the reasoning trace tighter and cheaper. Today it shares the main system prompt — simpler, but the trace can ramble.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No interrupt yet.&lt;/strong&gt; Once Phase 1 starts, it runs to completion. If the user types a follow-up mid-stream we let it finish. A real cancel button would mean wiring an abort signal through Ollama&amp;rsquo;s HTTP client — feasible, not yet done.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 occasionally over-thinks.&lt;/strong&gt; Greetings and trivial questions still produce a paragraph of reasoning. A &amp;ldquo;should I think?&amp;rdquo; gate (probably a tiny classifier or even a heuristic on query length) would skip Phase 1 entirely for those cases.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="takeaway"&gt;Takeaway&lt;/h2&gt;
&lt;p&gt;Streaming is &lt;em&gt;not&lt;/em&gt; just an optimisation. It&amp;rsquo;s a UX primitive. Two-phase streaming buys you a free property: the &lt;em&gt;visible&lt;/em&gt; part of the interaction starts before the &lt;em&gt;slow&lt;/em&gt; part does. The user gets to watch the model think, which is — genuinely — more interesting than watching a spinner.&lt;/p&gt;
&lt;p&gt;If your agent app feels slow even though the answers are fast, look at &lt;em&gt;when&lt;/em&gt; tokens start flowing. The fix often isn&amp;rsquo;t a faster model.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NDJSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Newline-Delimited JSON&lt;/td&gt;
&lt;td&gt;A stream where each line is its own complete JSON object — what &lt;code&gt;/rag&lt;/code&gt; emits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The universal text format for structured data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User Experience&lt;/td&gt;
&lt;td&gt;How the product feels to use — the real beneficiary of two-phase streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User Interface&lt;/td&gt;
&lt;td&gt;The visible surface the stream renders into&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;The dense half of hybrid retrieval (previous post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;The keyword half of hybrid retrieval (previous post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;The rank-only formula that merges the two result lists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstract Syntax Tree&lt;/td&gt;
&lt;td&gt;The parsed form of an expression — how the calculator evaluates maths without &lt;code&gt;eval()&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HyperText Transfer Protocol&lt;/td&gt;
&lt;td&gt;The protocol carrying the stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-Sent Events&lt;/td&gt;
&lt;td&gt;The browser&amp;rsquo;s built-in GET-only streaming format — notably &lt;em&gt;not&lt;/em&gt; usable here, because &lt;code&gt;/rag&lt;/code&gt; is a POST&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The boundary the frontend calls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— how CogniVault re-ingests edited PDFs without re-embedding everything, and survives a kill -9 mid-pipeline.&lt;/p&gt;</description></item><item><title>Part 2 · Hybrid Retrieval in Practice: FAISS + BM25, Fused with RRF</title><link>https://aretascodes.dev/blog/hybrid-retrieval-faiss-bm25-rrf/</link><pubDate>Sat, 25 Apr 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/hybrid-retrieval-faiss-bm25-rrf/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
, a fully local AI study companion. Previous:
.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The first version of CogniVault used pure dense retrieval — embed the query with &lt;code&gt;embeddinggemma&lt;/code&gt;, search a FAISS index, pass the top-7 chunks to the model. It worked. It worked &lt;em&gt;beautifully&lt;/em&gt; — until a user uploaded a PDF containing some German legal text and asked for &amp;ldquo;§3 Absatz 2.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;The model couldn&amp;rsquo;t find it.&lt;/p&gt;
&lt;p&gt;The chunk was &lt;em&gt;right there&lt;/em&gt;. The PDF was indexed. But &amp;ldquo;§3 Absatz 2&amp;rdquo; doesn&amp;rsquo;t embed into anything semantically meaningful — it&amp;rsquo;s a token-level identifier, not a concept. The dense vector for the query landed nowhere near the dense vector for the chunk, even though the chunk literally contains the string the user asked for.&lt;/p&gt;
&lt;p&gt;That bug killed pure dense retrieval for me. This post is about what replaced it.&lt;/p&gt;
&lt;h2 id="two-kinds-of-similar"&gt;Two kinds of &amp;ldquo;similar&amp;rdquo;&lt;/h2&gt;
&lt;p&gt;You already use both kinds of search every day. When Spotify builds a &amp;ldquo;song radio&amp;rdquo; from a track you like, it&amp;rsquo;s matching &lt;em&gt;feel&lt;/em&gt; — tempo, mood, genre — and it will happily play you a song whose title shares no words with the original. But when you type &lt;code&gt;Bohemian Rhapsody remastered 2011&lt;/code&gt; into the search box, you don&amp;rsquo;t want &lt;em&gt;feel&lt;/em&gt;. You want that exact string, and &amp;ldquo;a similar operatic rock epic&amp;rdquo; is a wrong answer.&lt;/p&gt;
&lt;p&gt;Search systems formalise that split into two notions of similarity:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Lexical similarity&lt;/strong&gt; — &amp;ldquo;do these strings share rare words?&amp;rdquo; This is what TF-IDF and BM25 model. They thrive on identifiers, names, code, technical terminology, and direct quotes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic similarity&lt;/strong&gt; — &amp;ldquo;do these passages talk about the same idea, even with different words?&amp;rdquo; This is what embeddings model. They thrive on paraphrase, conceptual queries, and natural-language questions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Neither subsumes the other. A user asking &lt;em&gt;&amp;ldquo;how is the practical exam structured?&amp;rdquo;&lt;/em&gt; needs &lt;strong&gt;semantic&lt;/strong&gt; search — the document doesn&amp;rsquo;t say &amp;ldquo;structure of practical exam.&amp;rdquo; A user asking &lt;em&gt;&amp;quot;§3 Absatz 2&amp;quot;&lt;/em&gt; needs &lt;strong&gt;lexical&lt;/strong&gt; search — there&amp;rsquo;s no concept to embed, just a literal string.&lt;/p&gt;
&lt;p&gt;Production RAG has to do both. CogniVault does both, and then fuses the result lists with &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="the-stack"&gt;The stack&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Query
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ├── embed via embeddinggemma ──► FAISS IndexFlatIP ──► top-K dense
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── tokenize + lowercase ──► BM25Okapi ──► top-K sparse
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; Reciprocal Rank Fusion ◄──┘
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; top-7 fused chunks
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Both indexes live &lt;strong&gt;in memory&lt;/strong&gt;, fronted by a &lt;code&gt;VectorDB&lt;/code&gt; singleton. FAISS does inner-product search over normalised embeddings (so dot product = cosine). BM25 is &lt;code&gt;rank_bm25&lt;/code&gt;&amp;rsquo;s &lt;code&gt;BM25Okapi&lt;/code&gt;, fed the same chunks tokenised by a simple lowercase-and-split tokenizer.&lt;/p&gt;
&lt;p&gt;The corpora are kept in lockstep: soft-deleting a file&amp;rsquo;s chunks triggers a BM25 rebuild over the remaining active chunks, and the singleton reloads both indexes from &lt;code&gt;vector_store.faiss&lt;/code&gt; + &lt;code&gt;vector_store.json&lt;/code&gt; (chunk metadata + raw text) after every ingestion run and on app start.&lt;/p&gt;
&lt;h2 id="why-faiss-indexflatip-not-hnsw-or-ivf"&gt;Why FAISS &lt;code&gt;IndexFlatIP&lt;/code&gt;, not HNSW or IVF?&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;IndexFlatIP&lt;/code&gt; is brute-force exact search. It scans every vector, every query. For tens of thousands of chunks that&amp;rsquo;s fine — sub-millisecond on a laptop. CogniVault is a &lt;strong&gt;single-user, local-first&lt;/strong&gt; app; the index is never going to be billions of vectors. Trading recall for speed via HNSW or IVF would buy nothing here and lose the &amp;ldquo;exact&amp;rdquo; guarantee. Boring, correct, fast enough.&lt;/p&gt;
&lt;p&gt;When the corpus grows large enough that brute-force gets sticky, switching is a one-line change. Until then, the simplest index wins.&lt;/p&gt;
&lt;h2 id="reciprocal-rank-fusion"&gt;Reciprocal Rank Fusion&lt;/h2&gt;
&lt;p&gt;The naive way to combine two ranked lists is to score them and add. That sounds reasonable until you remember FAISS returns inner-product scores in some bounded range and BM25 returns scores in an unbounded one — they aren&amp;rsquo;t comparable without normalisation, and any normalisation you pick is somewhat arbitrary.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RRF sidesteps the problem entirely.&lt;/strong&gt; It only looks at &lt;em&gt;ranks&lt;/em&gt;, not scores. For each result list, an item at rank &lt;code&gt;r&lt;/code&gt; contributes &lt;code&gt;1 / (k + r)&lt;/code&gt; to its final score (with &lt;code&gt;k = 60&lt;/code&gt; by convention — large enough to flatten the tail, small enough that the top items still dominate). Items that appear in both lists get summed.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified — the real implementation also de-duplicates chunks&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# by (source, chunk_id, page) before scoring.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reciprocal_rank_fusion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_lists&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result_lists&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;kv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;That&amp;rsquo;s the whole algorithm. No tuning, no calibration, no per-corpus weights. A chunk that&amp;rsquo;s #1 in BM25 and #4 in FAISS easily beats a chunk that&amp;rsquo;s #2 in only one of them. A chunk that &lt;em&gt;both&lt;/em&gt; indexes agree on rises to the top deterministically.&lt;/p&gt;
&lt;p&gt;The result for the &amp;ldquo;§3 Absatz 2&amp;rdquo; query: BM25 finds the literal match and lands it at rank 1. FAISS finds nothing useful (its top hits are about exam regulations in general). RRF surfaces the BM25 hit at the top of the fused list. Problem solved.&lt;/p&gt;
&lt;h2 id="scope-filtering-with-contextvar-isolation"&gt;Scope filtering with ContextVar isolation&lt;/h2&gt;
&lt;p&gt;One detail that&amp;rsquo;s easy to get wrong: the retriever has to be &lt;em&gt;scope-aware&lt;/em&gt;. CogniVault lets users limit a question to a single category or specific files. The scope is set by the request, but the search is called from deep inside the Strands agent loop, which is called from a streaming FastAPI handler, possibly with multiple concurrent requests in flight per worker.&lt;/p&gt;
&lt;p&gt;Threading the scope through every function call would be ugly. A global is unsafe. The right primitive is Python&amp;rsquo;s
, which gives you per-task isolated state that asyncio and threads both respect.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;contextvars&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ContextVar&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;_doc_scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ContextVar&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DocScope&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ContextVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;doc_scope&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_doc_scope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DocScope&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;_doc_scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;current_doc_scope&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DocScope&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_doc_scope&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;/rag&lt;/code&gt; request handler sets the scope at the very start of each streaming response; the search tool reads it; because the value is task-local, it dies with the request. No globals, no parameter drilling, no race conditions across concurrent users.&lt;/p&gt;
&lt;p&gt;This is one of those design choices that looks like over-engineering until you have two browser tabs open and realise that without it, tab A&amp;rsquo;s scope filter would leak into tab B&amp;rsquo;s question.&lt;/p&gt;
&lt;h2 id="chunking-choices-that-pay-off-downstream"&gt;Chunking choices that pay off downstream&lt;/h2&gt;
&lt;p&gt;Hybrid retrieval is only as good as the chunks. CogniVault uses a &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; with &lt;strong&gt;1,000 characters, 100 overlap&lt;/strong&gt; for unstructured text — small enough to keep retrieval precise, large enough to carry context for the model.&lt;/p&gt;
&lt;p&gt;For structured formats it switches strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Markdown&lt;/strong&gt; → &lt;code&gt;MarkdownHeaderTextSplitter&lt;/code&gt; emits one chunk per H1/H2/H3 section with the heading hierarchy prepended as a breadcrumb (&amp;ldquo;Privacy &amp;gt; Vault Audit &amp;gt; Indicators&amp;rdquo;). BM25 loves breadcrumbs — they make heading-keyword queries match cleanly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CSV&lt;/strong&gt; → header row + 20-row batches per chunk, so a query for a column name lands on the right block.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PPTX&lt;/strong&gt; → one chunk per slide, title and body text together.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;XLSX&lt;/strong&gt; → header + row batches, per sheet, with a &lt;code&gt;[Sheet: name]&lt;/code&gt; prefix.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tiny fragments get filtered: unstructured text needs at least &lt;strong&gt;100 characters&lt;/strong&gt; to become a chunk, while the structured formats drop the bar to &lt;strong&gt;20&lt;/strong&gt; — a two-line Markdown section or a header-only sheet is short but still meaningful. The recursive splitter is well-trodden territory, but the per-format strategies matter much more than people give them credit for.&lt;/p&gt;
&lt;h2 id="what-id-do-differently"&gt;What I&amp;rsquo;d do differently&lt;/h2&gt;
&lt;p&gt;A few things I&amp;rsquo;d revisit if I were starting over:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Stop tokenising for BM25 with &lt;code&gt;str.split()&lt;/code&gt;.&lt;/strong&gt; It&amp;rsquo;s fine, but a real tokenizer that handles punctuation and German compounds would meaningfully improve recall on the legal docs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add a small reranker.&lt;/strong&gt; RRF gets the right &lt;em&gt;set&lt;/em&gt;, but a cross-encoder rerank on the top 20 would polish the &lt;em&gt;order&lt;/em&gt;. Locally-served, of course — there are good small ones now.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query expansion for thin queries.&lt;/strong&gt; Two-word questions like &amp;ldquo;§3 exam&amp;rdquo; could be expanded via a quick &lt;code&gt;gemma4&lt;/code&gt; call before retrieval. Latency cost, recall gain.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of those are in the box yet. RRF over FAISS + BM25 is already so much better than either alone that I haven&amp;rsquo;t felt the pull to optimise further.&lt;/p&gt;
&lt;h2 id="the-takeaway"&gt;The takeaway&lt;/h2&gt;
&lt;p&gt;If your retrieval is &amp;ldquo;embed + cosine + top-k,&amp;rdquo; it will fail in exactly the way mine did — on the queries that contain literal identifiers your model has no embedding for. The fix isn&amp;rsquo;t a better embedding model. It&amp;rsquo;s a second retriever that doesn&amp;rsquo;t pretend everything is a concept.&lt;/p&gt;
&lt;p&gt;FAISS for ideas. BM25 for strings. RRF to decide which one was right today.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;Retrieve relevant passages from your own documents first; let the model answer from them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;Meta&amp;rsquo;s library for storing vectors and finding the most similar ones fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;A keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;Merges ranked lists using only ranks: each item scores &lt;code&gt;Σ 1/(k + rank)&lt;/code&gt; across lists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TF-IDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Term Frequency–Inverse Document Frequency&lt;/td&gt;
&lt;td&gt;BM25&amp;rsquo;s ancestor: score words by how often they appear here vs how rare they are everywhere&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IP&lt;/strong&gt; (in &lt;code&gt;IndexFlatIP&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Inner Product&lt;/td&gt;
&lt;td&gt;The similarity measure FAISS computes; on normalised vectors it equals cosine similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HNSW&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hierarchical Navigable Small World&lt;/td&gt;
&lt;td&gt;A popular &lt;em&gt;approximate&lt;/em&gt; vector-index structure — deliberately not used here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IVF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inverted File Index&lt;/td&gt;
&lt;td&gt;Another approximate FAISS index type — also deliberately not used&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AEVO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ausbildereignungsverordnung&lt;/td&gt;
&lt;td&gt;The German trainer-aptitude regulation whose &amp;ldquo;§3 Absatz 2&amp;rdquo; query broke pure dense retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CSV / PPTX / XLSX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Comma-Separated Values / PowerPoint / Excel (Office Open XML)&lt;/td&gt;
&lt;td&gt;Structured formats with their own chunking strategies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H1/H2/H3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Heading levels 1–3&lt;/td&gt;
&lt;td&gt;The Markdown heading tiers used to split sections&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— how CogniVault&amp;rsquo;s &lt;code&gt;/rag&lt;/code&gt; endpoint streams Gemma 4&amp;rsquo;s &lt;em&gt;thinking&lt;/em&gt; before any tool calls run.&lt;/p&gt;</description></item><item><title>Part 1 · Why I Built a Local-First RAG</title><link>https://aretascodes.dev/blog/why-local-first-rag/</link><pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/why-local-first-rag/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I&amp;rsquo;ve spent the last few years in front of virtual classrooms full of career-changers in Germany, walking them through programming basics, web development, and introductory AI courses. Most of the information we deal with is fine to paste into cloud-based AI tools. Some of it really isn&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;Exam materials under confidentiality. A trainee&amp;rsquo;s portfolio with personal details. Other private documents that should never end up training someone else&amp;rsquo;s model.&lt;/p&gt;
&lt;p&gt;So I built
— a fully local AI study and productivity tool. No cloud. No telemetry. No &amp;ldquo;we may use this data to improve our service.&amp;rdquo; Just Gemma 4 running on Ollama, on my laptop, talking to my files.&lt;/p&gt;
&lt;h2 id="the-leaky-abstraction"&gt;The leaky abstraction&lt;/h2&gt;
&lt;p&gt;The pitch for cloud AI is great: a giant model, available instantly, billed by the token. The fine print is where it gets uncomfortable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Where does the data physically live during inference?&lt;/li&gt;
&lt;li&gt;Whose jurisdiction governs that hardware this afternoon?&lt;/li&gt;
&lt;li&gt;Does the &lt;em&gt;audit trail&lt;/em&gt; stop at the API boundary, or can you actually trace what happened to your bytes?&lt;/li&gt;
&lt;li&gt;When you tick &amp;ldquo;do not train on my data,&amp;rdquo; are you trusting a control, a contract, or both?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For most consumer use cases, those questions are fine to wave away. For &lt;strong&gt;education, healthcare, finance, legal, public administration&lt;/strong&gt; — the answer &amp;ldquo;trust us&amp;rdquo; isn&amp;rsquo;t an answer.&lt;/p&gt;
&lt;h2 id="what-local-first-actually-means-here"&gt;What &amp;ldquo;local-first&amp;rdquo; actually means here&lt;/h2&gt;
&lt;p&gt;Lots of products say &amp;ldquo;private.&amp;rdquo; I wanted three concrete properties:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The model lives on your machine.&lt;/strong&gt; Gemma 4 (&lt;code&gt;gemma4:e4b&lt;/code&gt;) and &lt;code&gt;embeddinggemma&lt;/code&gt; are pulled via Ollama. Inference is a localhost HTTP call.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Your documents never leave.&lt;/strong&gt; Vectors, chunks, chat history, study sessions, achievements — all on disk on your computer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;You can &lt;em&gt;verify&lt;/em&gt; it.&lt;/strong&gt; Gemma CogniVault ships a &lt;strong&gt;Privacy Audit Panel&lt;/strong&gt; that shows a live &amp;ldquo;zero external connections&amp;rdquo; indicator alongside document counts and the Ollama host. It&amp;rsquo;s not a promise — it&amp;rsquo;s a status light.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If a future build of Gemma CogniVault ever made an outbound call, that panel would be the first thing to scream.&lt;/p&gt;
&lt;h2 id="what-you-get-back"&gt;What you get back&lt;/h2&gt;
&lt;p&gt;Going local sounds like a trade-off — surely you lose the magic of the giant frontier models? In practice, with &lt;strong&gt;Gemma 4&lt;/strong&gt; you get more than enough:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Thinking mode&lt;/strong&gt; — Gemma 4&amp;rsquo;s chain-of-thought streams into a collapsible panel before the answer. Watching the model reason about your documents is genuinely useful as a teaching tool.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tool use&lt;/strong&gt; — through the
, the model decides when to search the knowledge base, summarise a document, compare two files, or check the time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vision&lt;/strong&gt; — attach images and PDFs straight into a chat turn.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generation that&amp;rsquo;s actually structured&lt;/strong&gt; — quizzes, multi-lesson workshops, flashcard decks, and interactive mindmaps, generated with &lt;code&gt;format=&amp;quot;json&amp;quot;&lt;/code&gt; so the output parses reliably.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cognivault doesn&amp;rsquo;t try to be a giant ecosystem. It&amp;rsquo;s a single-purpose tool that does one thing well: use your own documents with a capable local model in a private environment. I must admit that it was inspired to a great extent by
, which I&amp;rsquo;ve found incredibly useful but not private enough for my needs.&lt;/p&gt;
&lt;h2 id="the-shape-of-the-app"&gt;The shape of the app&lt;/h2&gt;
&lt;p&gt;CogniVault is split into four sections that map to how I actually work with information on cloud-based AI tools:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;What it&amp;rsquo;s for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ask anything about your documents. Cited answers, scope filter, voice in.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge Base&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upload, categorise, manage. SHA-256 detects edits on re-upload.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Study Hub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Quiz · Workshop · Flashcards · Mindmaps — four ways to drill into the source.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dashboard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total study time, streak, 25 badges, GitHub-style 90-day heatmap.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Everything reachable from a sidebar that remembers where you left off, on a stack that fits in your &lt;code&gt;~/Documents&lt;/code&gt; folder.&lt;/p&gt;
&lt;h2 id="what-comes-next"&gt;What comes next&lt;/h2&gt;
&lt;p&gt;This is the first in a short series. Over the next few posts I&amp;rsquo;ll dig into the parts I&amp;rsquo;m most proud of — and a few I&amp;rsquo;d build differently next time:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hybrid retrieval&lt;/strong&gt; — why FAISS &lt;em&gt;and&lt;/em&gt; BM25, fused with Reciprocal Rank Fusion&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Two-phase streaming&lt;/strong&gt; with Gemma 4 and Strands Agents&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Crash-resumable ingestion&lt;/strong&gt; with DBOS, hash-aware re-ingest, OCR fallback&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Getting reliable JSON&lt;/strong&gt; out of a local LLM (and what to do when it fails)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The mindmap renderer&lt;/strong&gt; — what hand-rolling SVG taught me, and why v2 uses React Flow&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gamifying learning&lt;/strong&gt; — 25 badges, idle-gap sessions, 90-day heatmap&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Testing a local-AI app&lt;/strong&gt; with 350+ tests and zero infrastructure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want to skip ahead, the code is open source at
, and there&amp;rsquo;s a
.&lt;/p&gt;
&lt;p&gt;Your data. Your hardware. Your AI. Your vault.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Artificial Intelligence&lt;/td&gt;
&lt;td&gt;Software performing tasks that normally need human intelligence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large Language Model&lt;/td&gt;
&lt;td&gt;A neural network trained on huge amounts of text that can read and generate language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HyperText Transfer Protocol&lt;/td&gt;
&lt;td&gt;The protocol browsers and APIs use to exchange requests and responses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The boundary where you call someone else&amp;rsquo;s software — and where cloud audit trails stop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IHK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Industrie- und Handelskammer&lt;/td&gt;
&lt;td&gt;The German Chamber of Commerce and Industry, which administers trainer certification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AEVO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ausbildereignungsverordnung&lt;/td&gt;
&lt;td&gt;The German trainer-aptitude regulation — the exam material that motivated this project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;Meta&amp;rsquo;s vector-search library (covered in the next post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;A classic keyword-ranking formula (also next post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Software Development Kit&lt;/td&gt;
&lt;td&gt;A library of building blocks — here, Strands, which provides the agent loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The universal text format for structured data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Portable Document Format&lt;/td&gt;
&lt;td&gt;One of the eight-plus file types CogniVault ingests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SHA-256&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secure Hash Algorithm, 256-bit&lt;/td&gt;
&lt;td&gt;A content fingerprint used to detect edited files on re-upload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OCR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optical Character Recognition&lt;/td&gt;
&lt;td&gt;Turning pictures of text (scans) into machine-readable text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database-Oriented Operating System&lt;/td&gt;
&lt;td&gt;The durable-workflow library behind crash-resumable ingestion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SVG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scalable Vector Graphics&lt;/td&gt;
&lt;td&gt;The browser&amp;rsquo;s built-in vector drawing format&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;</description></item></channel></rss>