<link>https://aretascodes.dev/</link></image><item><title>CogniVault Backend Explained, Part 1 · Meet the Backend: Three Processes, Four Layers</h1> <article> <h1>CogniVault Backend Explained, Part 1 · Meet the Backend: Three Processes, Four Layers</h1> <p>Fri, 12 Jun 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>When people first open the CogniVault repository, the question I hear most is some version of: <em>“Where do I even start?”</em> There’s a RAG agent, a FAISS index, a DBOS workflow, an Ollama host — and if you’re transitioning into tech, every one of those words is a closed door.</p> <p>This series opens the doors one at a time. No prior RAG knowledge assumed, every abbreviation spelled out, and every claim checkable against the . If you’ve already read my , think of this series as the guided tour that should have come first.</p> <p>Let’s map this out.</p> <h2 id="the-whole-app-is-three-processes">The whole app is three processes</h2> <p>CogniVault lets you chat with your own documents and turn them into quizzes, workshops, flashcards, and mindmaps — and nothing ever leaves your machine. (The <em>why</em> behind that constraint is its own story: .)</p> <p>You might expect an app like that to be a sprawl of microservices. It’s three processes:</p> <table> <thead> <tr> <th>Process</th> <th>What it does</th> </tr> </thead> <tbody> <tr> <td><strong>The Python backend</strong></td> <td>One FastAPI app on port 8000 — it also serves the compiled React frontend as static files</td> </tr> <tr> <td><strong>Ollama</strong></td> <td>The local model server on port 11434, running the AI models</td> </tr> <tr> <td><strong>PostgreSQL</strong></td> <td>One Docker container, used <em>only</em> for workflow checkpoints — never for your documents</td> </tr> </tbody> </table> <p>Everything else — your files, the search index, your chat history, your quiz scores — is a plain file on disk. That’s not laziness; it’s the privacy argument made physical. You can open every byte the app stores with a text editor and a SQLite browser.</p> <h2 id="the-four-layers">The four layers</h2> <p>Before we name technologies, here’s the mental model I want you to keep for the whole series. The backend is four layers, top to bottom:</p> <p><strong>Layer 1 — the web layer.</strong> A FastAPI application receives every HTTP request and routes it to one of six routers: chat (<code>/rag</code>), knowledge management (<code>/upload</code>, <code>/ingest</code>), study tools (<code>/api/study/*</code>), progress (<code>/api/progress/*</code>), voice (<code>/api/transcribe</code>), and chat history (<code>/api/history</code>). FastAPI (a modern Python web framework) also auto-generates interactive API documentation at <code>/api/docs</code>, which is the best way to explore the backend without reading a line of code.</p> <p><strong>Layer 2 — the intelligence layer.</strong> Two AI models with two different jobs. <code>gemma4:e4b</code> <em>generates</em>: chat answers, reasoning, image analysis, and tool calls. <code>embeddinggemma</code> <em>embeds</em>: it turns text into vectors (lists of numbers that capture meaning) so similar ideas can be found mathematically. Both run inside Ollama — think of Ollama as Docker, but for AI models.</p> <p><strong>Layer 3 — the retrieval layer.</strong> A search engine over your documents that combines <em>semantic</em> search (find things that mean the same) with <em>keyword</em> search (find the exact string). Part 3 of this series is entirely about this layer.</p> <p><strong>Layer 4 — the persistence layer.</strong> Four storage systems, each picked for one job: a FAISS index plus a JSON file for searchable knowledge, SQLite for study data, PostgreSQL for workflow checkpoints, and plain JSON files for chat history.</p> <h2 id="one-diagram-every-major-piece">One diagram, every major piece</h2> <div class="mermaid">flowchart TB subgraph CLIENT["Browser"] UI["React Frontend<br/>(compiled, served by FastAPI)"] end subgraph SERVER["FastAPI Backend — port 8000"] ROUTERS["6 Routers<br/>rag · knowledge · study ·<br/>progress · audio · history"] AGENT["RAG Agent<br/>(Strands SDK, 6 tools)"] VDB["VectorDB<br/>FAISS + BM25 + RRF"] INGEST["Ingestion<br/>(DBOS durable workflow)"] GEN["Study generators<br/>quiz · workshop · cards · mindmap"] PROG["Progress tracker<br/>+ 25 achievements"] end subgraph OLLAMA["Ollama — port 11434"] GEMMA["gemma4:e4b<br/>chat · thinking · vision · tools"] EMBED["embeddinggemma<br/>text to vectors"] end subgraph STORAGE["Local storage"] FAISSF["vector_store.faiss + .json"] SQLITE["progress.db (SQLite)"] PG["PostgreSQL<br/>workflow state only"] DOCS["docs/ folder + chat_history.json"] end UI --> ROUTERS ROUTERS --> AGENT --> VDB AGENT --> GEMMA VDB --> EMBED ROUTERS --> INGEST --> EMBED INGEST --> PG INGEST --> FAISSF VDB --- FAISSF ROUTERS --> GEN --> GEMMA GEN --> SQLITE ROUTERS --> PROG --> SQLITE ROUTERS --> DOCS </div> <p>Keep this picture handy — Parts 2, 3, and 4 each zoom into one region of it.</p> <h2 id="the-tech-stack-and-why-each-piece-earned-its-place">The tech stack, and why each piece earned its place</h2> <p>The full dependency list lives in <code>requirements.txt</code>. Here’s what matters, grouped by job:</p> <p><strong>Serving requests.</strong> FastAPI defines the endpoints and validates every request and response with Pydantic (a data-validation library — think of it as a strict customs officer for JSON). Uvicorn is the ASGI server (Asynchronous Server Gateway Interface — the Python standard that lets one process juggle many simultaneous requests) that actually runs it.</p> <p><strong>Thinking.</strong> Ollama serves <code>gemma4:e4b</code> — the <code>e4b</code> tag is the roughly four-billion effective-parameter variant, about a 9.6 GB download — and <code>embeddinggemma</code> (about 622 MB). The agent behaviour is built with the Strands Agents SDK, which wraps the model in a loop where it can call tools, read the results, and only then answer. (Where I run Ollama relative to Docker is a deliberate choice with a story behind it: .)</p> <p><strong>Finding things.</strong> FAISS (Facebook AI Similarity Search — Meta’s vector search library) handles semantic lookups; <code>rank-bm25</code> handles keyword lookups; a formula called Reciprocal Rank Fusion merges the two. Part 3 unpacks all of this.</p> <p><strong>Reading documents.</strong> <code>pypdf</code> for PDFs, with an OCR fallback (Optical Character Recognition — turning pictures of text into actual text) for scanned pages via <code>pymupdf</code> and Tesseract. Word, PowerPoint, and Excel each get their own extractor. <code>trafilatura</code> pulls clean article text out of web pages.</p> <p><strong>Not losing work.</strong> DBOS makes the ingestion pipeline durable — every step is checkpointed in PostgreSQL so a crash resumes instead of restarting. Part 2 shows this in action.</p> <p><strong>Remembering.</strong> SQLite — a complete database engine that lives in a single file, <code>progress.db</code> — holds your study sessions, achievements, quizzes, workshops, flashcard decks, and mindmaps.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <p>This series’ promise is “no unexplained abbreviations,” so here is the table I wish every technical tutorial shipped with.</p> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Plain-English meaning</th> </tr> </thead> <tbody> <tr> <td><strong>LLM</strong></td> <td>Large Language Model</td> <td>A neural network trained on huge amounts of text that can read and generate language</td> </tr> <tr> <td><strong>RAG</strong></td> <td>Retrieval-Augmented Generation</td> <td>Fetch relevant passages from <em>your</em> documents first, then let the model answer from them — instead of from its training memory</td> </tr> <tr> <td><strong>API</strong></td> <td>Application Programming Interface</td> <td>The set of URLs the frontend calls to talk to the backend</td> </tr> <tr> <td><strong>ASGI</strong></td> <td>Asynchronous Server Gateway Interface</td> <td>The Python standard that lets the server handle many requests concurrently</td> </tr> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The universal text format for structured data</td> </tr> <tr> <td><strong>NDJSON</strong></td> <td>Newline-Delimited JSON</td> <td>A stream where each line is its own JSON object — ideal for streaming AI answers chunk by chunk</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>Meta’s library for storing vectors and finding the most similar ones fast</td> </tr> <tr> <td><strong>BM25</strong></td> <td>Best Match 25</td> <td>A classic keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system</td> </tr> <tr> <td><strong>RRF</strong></td> <td>Reciprocal Rank Fusion</td> <td>A formula for merging multiple ranked result lists using only the ranks</td> </tr> <tr> <td><strong>ANN</strong></td> <td>Approximate Nearest Neighbour</td> <td>A speed shortcut many vector databases take. CogniVault deliberately uses an <em>exact</em> index instead — precise, and plenty fast at personal-library scale</td> </tr> <tr> <td><strong>DBOS</strong></td> <td>Database-Oriented Operating System (the research project it grew from)</td> <td>A library that checkpoints workflow steps in a database so crashed jobs resume</td> </tr> <tr> <td><strong>SQL / SQLite</strong></td> <td>Structured Query Language / SQLite</td> <td>The language of relational databases / a tiny database that lives in one file</td> </tr> <tr> <td><strong>OCR</strong></td> <td>Optical Character Recognition</td> <td>Turning pictures of text (scans) into machine-readable text</td> </tr> <tr> <td><strong>SHA-256</strong></td> <td>Secure Hash Algorithm, 256-bit</td> <td>A fingerprint function — any file maps to a unique hash, used to detect changed files</td> </tr> <tr> <td><strong>CORS</strong></td> <td>Cross-Origin Resource Sharing</td> <td>Browser rules controlling which websites may call the API</td> </tr> <tr> <td><strong>SSRF</strong></td> <td>Server-Side Request Forgery</td> <td>An attack where a server is tricked into fetching internal URLs — the URL-import endpoint guards against it</td> </tr> <tr> <td><strong>MCQ</strong></td> <td>Multiple-Choice Question</td> <td>One of the two quiz question types</td> </tr> <tr> <td><strong>KB</strong></td> <td>Knowledge Base</td> <td>All your ingested, searchable documents</td> </tr> </tbody> </table> <p>(Every claim in this series can be checked directly against the — the relevant file is named whenever it matters, and the repository README maps the full architecture.)</p> <h2 id="the-takeaway">The takeaway</h2> <p>Strip away the abbreviations and CogniVault is a small system: one web server, one model runtime, one durability database, and a handful of files. The sophistication isn’t in the part count — it’s in how a few well-chosen pieces cooperate. That cooperation is what the next three parts are about.</p> <hr> <p><strong>Next up:</strong> — how a 1,000-page scanned PDF becomes something the AI can search in seconds, and why the pipeline survives a crash at page 800.</p> </article> <article> <h1>CogniVault Backend Explained, Part 2 · From File to Searchable Knowledge</h1> <p>Fri, 12 Jun 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>An LLM cannot “open” your PDF. That sentence surprises a lot of newcomers, so let’s sit with it for a second: when you chat with your documents in CogniVault, the model never touches the original files. Something has to happen <em>between</em> “I dropped a file into the browser” and “the AI just quoted page 47 back at me.”</p> <p>That something is <strong>ingestion</strong>, and it’s the subject of this part. In we drew the whole map; today we zoom into one region — the conveyor belt that turns files into searchable knowledge.</p> <h2 id="the-conveyor-belt">The conveyor belt</h2> <p>Think of ingestion as a four-station assembly line:</p> <ol> <li><strong>Extract</strong> the text out of each file — even scanned ones.</li> <li><strong>Chunk</strong> it into pieces small enough to fit into a prompt.</li> <li><strong>Embed</strong> each chunk — turn it into a vector (a list of numbers that captures its meaning) so similar ideas land near each other in vector space.</li> <li><strong>Store</strong> vectors and metadata so they can be searched later.</li> </ol> <div class="mermaid">flowchart TD A["Upload<br/>POST /upload<br/>saved to docs/"] --> B subgraph WF["DBOS durable workflow"] B["Step 1<br/>Which files changed?<br/>SHA-256 fingerprints"] --> C["Step 2<br/>Extract text<br/>per-format + OCR fallback"] C --> D["Chunk<br/>1000 chars, 100 overlap"] D --> E["Step 3<br/>Embed<br/>embeddinggemma, batches of 5"] E --> F["Step 4<br/>Save<br/>FAISS index + metadata JSON"] end F --> G["Reload in-memory index<br/>instantly searchable"] </div> <p>Simple enough. The interesting engineering is in the failure cases — so let’s start there.</p> <h2 id="the-factory-ledger-why-the-pipeline-cant-lose-work">The factory ledger: why the pipeline can’t lose work</h2> <p>Embedding a large library takes minutes. What happens when your laptop goes to sleep at page 800 of a 1,000-page manual? With a plain Python script: everything restarts from page 1.</p> <p>CogniVault instead writes the pipeline as a <strong>DBOS durable workflow</strong>. Picture a factory where every station stamps a permanent ledger the moment it finishes a box. If the power cuts out, nobody rebuilds finished boxes — the workers read the ledger and resume at the first unstamped entry.</p> <p>DBOS is that ledger, and PostgreSQL is the book it’s written in. Each pipeline station is a checkpointed step; on restart, completed steps return their recorded results instantly and execution continues from the first unfinished one. A failed embedding batch is simply retried.</p> <p>This is also what powers the live progress timeline in the UI: starting an ingestion returns a <code>workflow_id</code>, and the frontend polls a status endpoint that reports which steps have completed, which are running, and which are still waiting.</p> <p>I wrote a whole deep dive on this mechanism — including what happens when you <code>kill -9</code> the process mid-ingest — in .</p> <h2 id="fingerprints-not-faith-sha-256-change-detection">Fingerprints, not faith: SHA-256 change detection</h2> <p>Re-embedding your whole library every time you add one file would be wasteful. So before any work happens, the pipeline computes each file’s <strong>SHA-256 hash</strong> (a content fingerprint — change one character in the file and the fingerprint changes completely) and compares it to the fingerprint stored with the file’s existing chunks:</p> <ul> <li><strong>Never seen before</strong> → ingest it.</li> <li><strong>Fingerprint changed</strong> → the old chunks are <em>soft-deleted</em> and the file is re-ingested.</li> <li><strong>Fingerprint identical</strong> → skip it entirely.</li> </ul> <p>Why “soft”-deleted? Because the FAISS index type CogniVault uses cannot remove individual vectors. Stale chunks are just marked <code>deleted: true</code> in the metadata; their vectors stay in the index but every search filters them out. It’s an honest, boring solution — and it never corrupts the index.</p> <h2 id="every-format-gets-its-own-treatment">Every format gets its own treatment</h2> <p>Here’s a detail that separates a demo from a product. A naive pipeline extracts “all the text” and calls it a day. CogniVault gives each format an extractor that preserves the <em>structure</em> that retrieval will need later:</p> <table> <thead> <tr> <th>Format</th> <th>Strategy</th> </tr> </thead> <tbody> <tr> <td><strong>PDF</strong></td> <td>Page by page, keeping page numbers (those become citations later). Any page yielding fewer than 50 characters is presumed scanned and sent to OCR</td> </tr> <tr> <td><strong>Scanned page</strong></td> <td>The page is rendered to an image at roughly 144 dpi, then Tesseract OCR (Optical Character Recognition — reading text out of images) extracts the words</td> </tr> <tr> <td><strong>Markdown</strong></td> <td>Split on headings; each section chunk gets a breadcrumb prefix like <code>[Section: Intro > Setup]</code> so its embedding carries the document hierarchy</td> </tr> <tr> <td><strong>CSV</strong></td> <td>Rows grouped 20 per chunk — and <em>every</em> chunk is prefixed with the header row, so the model always knows the column names</td> </tr> <tr> <td><strong>Excel</strong></td> <td>Same row-group idea per sheet, prefixed <code>[Sheet: name]</code></td> </tr> <tr> <td><strong>PowerPoint</strong></td> <td>One chunk per slide</td> </tr> <tr> <td><strong>Word</strong></td> <td>Paragraphs plus table cells</td> </tr> <tr> <td><strong>Web pages</strong></td> <td>Fetched on request and stripped to clean article text — behind an SSRF guard (Server-Side Request Forgery protection: the server refuses to fetch private or internal addresses)</td> </tr> </tbody> </table> <p>Ask yourself why the CSV detail matters. If chunk 14 of a spreadsheet is just twenty naked rows of numbers, no search will ever connect it to the question “what was the Q3 budget?” Prefix it with the header row, and the chunk <em>knows</em> it contains budget columns. Structure is retrieval fuel.</p> <h2 id="chunking-1000-characters-with-a-100-character-safety-overlap">Chunking: 1,000 characters with a 100-character safety overlap</h2> <p>Long text is split into pieces of about 1,000 characters, with neighbouring pieces overlapping by 100. The overlap is insurance: a sentence sliced at a chunk boundary still appears whole in one of the two neighbours, so no idea falls into the gap between chunks.</p> <h2 id="embedding-and-saving">Embedding and saving</h2> <p>Chunks are embedded by <code>embeddinggemma</code> (via Ollama) in batches of five — each chunk becomes one vector. The vectors are normalised and appended to a FAISS index; alongside it, a JSON file records each chunk’s source filename, page number, category, fingerprint, and the text itself. The index holds the numbers; the JSON holds the meaning.</p> <p>One choice worth highlighting for beginners: this is an <strong>exact</strong> index, not an approximate one. Many vector databases use ANN (Approximate Nearest Neighbour) shortcuts that trade a little accuracy for speed at massive scale. At personal-library scale you don’t need the trade — CogniVault checks every vector on every search and is still fast.</p> <h2 id="the-whole-journey-end-to-end">The whole journey, end to end</h2> <div class="mermaid">%%{init: {'sequence': {'actorFontSize': 28, 'messageFontSize': 24, 'loopTextFontSize': 22, 'noteFontSize': 22}}}%% sequenceDiagram actor U as You participant F as Frontend participant B as FastAPI participant W as DBOS Workflow participant O as Ollama (embeddinggemma) participant V as FAISS + metadata U->>F: Drag and drop a file, pick a category F->>B: POST /upload B->>B: Validate type and size, save to docs/ F->>B: POST /ingest B->>W: Start durable workflow B-->>F: workflow_id loop Poll status F->>B: GET /ingest/status/{workflow_id} B-->>F: Step list (drives the progress timeline) end W->>W: SHA-256 change detection W->>W: Extract text (per format, OCR if scanned) W->>W: Chunk (1000 chars / 100 overlap) W->>O: Embed in batches of 5 O-->>W: Vectors W->>V: Append vectors + metadata B-->>F: SUCCESS — index reloaded F-->>U: "Knowledge Sync Complete" </div> <h2 id="the-takeaway">The takeaway</h2> <p>Ingestion is where most RAG quality is actually won or lost — long before any clever prompting. Page numbers preserved, headers carried into every spreadsheet chunk, scans rescued by OCR, and a ledger that makes the whole thing crash-proof: none of it is glamorous, all of it shows up later as answers that cite the right page.</p> <hr> <h3 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h3> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>LLM</strong></td> <td>Large Language Model</td> <td>A neural network trained on huge amounts of text that can read and generate language</td> </tr> <tr> <td><strong>DBOS</strong></td> <td>Database-Oriented Operating System</td> <td>The library that checkpoints workflow steps in PostgreSQL so crashed jobs resume</td> </tr> <tr> <td><strong>SHA-256</strong></td> <td>Secure Hash Algorithm, 256-bit</td> <td>A content fingerprint — change one byte of a file and the hash changes completely</td> </tr> <tr> <td><strong>OCR</strong></td> <td>Optical Character Recognition</td> <td>Reading text out of images — the rescue path for scanned PDF pages</td> </tr> <tr> <td><strong>SSRF</strong></td> <td>Server-Side Request Forgery</td> <td>An attack where a server is tricked into fetching internal URLs; the URL importer blocks it</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>The vector index the embeddings are appended to</td> </tr> <tr> <td><strong>ANN</strong></td> <td>Approximate Nearest Neighbour</td> <td>The accuracy-for-speed shortcut CogniVault deliberately does <em>not</em> take</td> </tr> <tr> <td><strong>dpi</strong></td> <td>Dots Per Inch</td> <td>Image resolution — scanned pages are rendered at ~144 dpi before OCR</td> </tr> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The format of the chunk-metadata file beside the FAISS index</td> </tr> <tr> <td><strong>PDF / CSV</strong></td> <td>Portable Document Format / Comma-Separated Values</td> <td>Two of the eight-plus supported file formats</td> </tr> <tr> <td><strong>API</strong></td> <td>Application Programming Interface</td> <td>The endpoints (<code>/upload</code>, <code>/ingest</code>, <code>/ingest/status/…</code>) driving the flow</td> </tr> </tbody> </table> <hr> <p><strong>Next up:</strong> — hybrid retrieval, the six-tool agent, and the two-phase stream that shows the model think before it answers.</p> </article> <article> <h1>CogniVault Backend Explained, Part 3 · How a Question Becomes a Cited Answer</h1> <p>Fri, 12 Jun 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>You type a question. A few seconds later you get an answer with footnotes — the exact documents and pages it came from. This part walks through everything that happens in between.</p> <p>In we built the knowledge base: every document chunked, embedded, and indexed. Now we get to <em>use</em> it — and this is where CogniVault stops being a pipeline and starts being interesting.</p> <h2 id="two-librarians-because-one-keeps-failing-you">Two librarians, because one keeps failing you</h2> <p>Imagine a library with one librarian who organises everything by <em>vibe</em>. Ask her about “server downtime procedures” and she’s brilliant — she understands what you mean and finds documents that discuss the concept, whatever words they use. But ask her for “Error Code 404B” and she shrugs, handing you general networking guides. She doesn’t do exact strings.</p> <p>Down the hall is a second librarian with a card catalogue. He finds the exact string “404B” instantly — but ask him a conceptual question phrased differently from the source text, and he finds nothing at all.</p> <p>These are the two halves of search:</p> <ul> <li><strong>Semantic search (FAISS)</strong> — your question is embedded into a vector, and the index finds chunks whose vectors point the same way (technically: cosine similarity — how closely two arrows align). Great for meaning, blind to exact identifiers.</li> <li><strong>Keyword search (BM25)</strong> — a scoring formula that rewards chunks containing your <em>exact</em> words, weighted by how distinctive those words are. Great for identifiers, blind to synonyms.</li> </ul> <p>CogniVault asks <strong>both librarians every time</strong>, then merges their answers with <strong>Reciprocal Rank Fusion (RRF)</strong> — a formula that combines ranked lists using only the <em>positions</em>:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">score(chunk) = sum over both lists of 1 / (60 + rank) </span></span></code></pre></div><p>A chunk ranked highly by either librarian scores well; a chunk both of them liked floats to the top. The elegance is what’s <em>missing</em>: you never have to reconcile FAISS’s similarity scores with BM25’s completely different scale, because ranks are the only input. The constant 60 comes straight from the original 2009 research paper, and yes, it’s cited in the code.</p> <p>A few implementation details worth knowing: both searches deliberately over-fetch (at least 20 candidates each) so the fusion has material to work with; very weak semantic matches are dropped, but a keyword-perfect chunk can still be rescued through fusion; and the final answer uses the top 7 chunks. I benchmarked this whole setup against pure vector search in if you want the war stories.</p> <h2 id="the-agent-a-model-that-decides-for-itself">The agent: a model that decides for itself</h2> <p>Here’s the second idea that trips up beginners: CogniVault’s chat is not “paste chunks into a prompt, get an answer.” It’s an <strong>agent</strong> — a model running in a loop where it can <em>choose</em> to call tools, read their results, and only then answer.</p> <p>Built with the Strands Agents SDK, the agent gets six tools:</p> <table> <thead> <tr> <th>Tool</th> <th>Job</th> </tr> </thead> <tbody> <tr> <td><code>search_knowledge_base</code></td> <td>The core RAG tool — runs the hybrid search above, returns chunks with source and page</td> </tr> <tr> <td><code>list_documents</code></td> <td>See what’s in the vault</td> </tr> <tr> <td><code>analyze_document</code></td> <td>Structured analysis of one document: topics, entities, facts, summary</td> </tr> <tr> <td><code>compare_documents</code></td> <td>Answer a question by comparing two documents side by side</td> </tr> <tr> <td><code>calculator</code></td> <td>Safe maths — the expression is parsed into a syntax tree and only whitelisted operators run. No <code>eval()</code>, ever</td> </tr> <tr> <td><code>current_time</code></td> <td>The date and time</td> </tr> </tbody> </table> <p>There is no hard-coded routing. The <em>model</em> reads your question and decides which tools to call, guided by its system prompt. Ask “compare the two contracts on termination clauses” and it reaches for <code>compare_documents</code>; ask “what’s 15% of 2,340” and it uses the calculator instead of hallucinating arithmetic.</p> <p>Two safety details I want beginners to notice, because they’re the difference between a toy and a product: a <strong>fresh agent is constructed for every request</strong> (no shared state bleeding between concurrent chats), and the document-analysis tools call the model <em>directly</em> rather than through the agent — otherwise an agent calling a tool that calls the agent could recurse forever.</p> <h2 id="watching-the-model-think">Watching the model think</h2> <p>When you send a message, the response streams back as <strong>NDJSON</strong> (Newline-Delimited JSON — each line of the stream is its own small JSON object). And it arrives in two phases:</p> <p><strong>Phase 1 — thinking.</strong> Gemma’s reasoning chain streams first, rendered in the collapsible panel above the answer. It’s deliberately best-effort: if it fails for any reason, the answer still comes.</p> <p><strong>Phase 2 — the agent answer.</strong> Tools run, citations appear in the Sources panel the moment the search completes — <em>before</em> the answer finishes writing — and the answer text streams in.</p> <div class="mermaid">flowchart TB Q["Your question<br/>(plus optional images, files, scope)"] --> P1 subgraph STREAM["POST /rag — one NDJSON stream"] P1["Phase 1: Thinking<br/>reasoning chunks stream first"] P1 --> P2["Phase 2: Agent<br/>fresh per request, history restored"] P2 -->|"decides to call"| T["search_knowledge_base"] T --> D["FAISS<br/>semantic"] T --> S["BM25<br/>keywords"] D --> RRF["RRF fusion — top 7 chunks"] S --> RRF RRF -->|"chunks + citations"| P2 P2 --> OUT["citations, then answer text,<br/>then a memory-usage report"] end </div> <p>Each line in the stream is typed: <code>thinking</code>, <code>metadata</code> (a citation), <code>text</code> (answer), <code>memory</code> (how full the conversation budget is), or <code>error</code>. The frontend just reads lines and routes them to the right panel. I dissected this design — and why thinking comes <em>before</em> the tool calls — in .</p> <h2 id="a-memory-budget-not-a-bottomless-pit">A memory budget, not a bottomless pit</h2> <p>Gemma’s context window (the amount of text the model can consider at once) is 128K tokens, but CogniVault doesn’t let conversation history sprawl across all of it. Each chat session gets a budget of 48,000 characters — roughly 12,000 tokens. Exceed it, and the <em>oldest</em> question-answer pair quietly drops out first, keeping the bulk of the window free for what matters: your current question and the retrieved chunks.</p> <p>Two resilience touches worth stealing for your own projects:</p> <ul> <li><strong>Restart survival.</strong> In-memory history dies with the process. So the first message in a session after a backend restart rebuilds its history from the chat log the frontend persists. Multi-turn memory survives reboots.</li> <li><strong>Edit and regenerate.</strong> Editing an earlier message rewinds the stored history to that point before re-asking — the model genuinely forgets the timeline that no longer exists.</li> </ul> <h2 id="scope-pinning-the-ai-to-specific-documents">Scope: pinning the AI to specific documents</h2> <p>One last feature, and a lesson about small local models. You can pin a chat to specific files or a category. The filter travels with the request <em>and</em> a mandatory-search instruction is injected into both the system prompt and the user message itself.</p> <p>Why both? Because small models sometimes skip instructions that live only in the system prompt — but they can’t ignore what’s inside the question. Belt and braces. When you work with 4-billion-parameter models instead of frontier ones, you learn to make instructions impossible to miss rather than hoping they’re followed.</p> <h2 id="the-takeaway">The takeaway</h2> <p>A cited answer is four systems cooperating: two retrievers covering each other’s blind spots, a fusion formula that needs nothing but ranks, an agent that picks its own tools, and a stream that shows its work. None of the four is exotic on its own — the product is the cooperation.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>RAG</strong></td> <td>Retrieval-Augmented Generation</td> <td>Retrieve relevant passages from your own documents first; let the model answer from them</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>The semantic (meaning-based) half of hybrid search</td> </tr> <tr> <td><strong>BM25</strong></td> <td>Best Match 25</td> <td>The keyword half — a classic ranking formula from the Okapi information-retrieval system</td> </tr> <tr> <td><strong>RRF</strong></td> <td>Reciprocal Rank Fusion</td> <td>Merges the two ranked lists using only each chunk’s rank: <code>score = Σ 1/(60 + rank)</code></td> </tr> <tr> <td><strong>NDJSON</strong></td> <td>Newline-Delimited JSON</td> <td>A stream where each line is its own complete JSON object — the chat response format</td> </tr> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The universal text format for structured data</td> </tr> <tr> <td><strong>AST</strong></td> <td>Abstract Syntax Tree</td> <td>The parsed form of an expression — how the calculator does maths without <code>eval()</code></td> </tr> <tr> <td><strong>LLM</strong></td> <td>Large Language Model</td> <td>A neural network trained on huge amounts of text that can read and generate language</td> </tr> <tr> <td><strong>SDK</strong></td> <td>Software Development Kit</td> <td>A library of building blocks — here, Strands, which provides the agent loop</td> </tr> <tr> <td><strong>K</strong> (in 128K)</td> <td>Kilo (thousand)</td> <td>128K tokens ≈ 128,000 tokens — Gemma’s context window</td> </tr> </tbody> </table> <hr> <p><strong>Next up:</strong> — the same machinery pointed at generating quizzes, workshops, flashcards, and mindmaps, plus a table of every byte the app stores and exactly where it lives.</p> </article> <article> <h1>CogniVault Backend Explained, Part 4 · Study Tools, Progress, and the Privacy Receipts</h1> <p>Fri, 12 Jun 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>In we followed a question through hybrid retrieval and the agent loop to a cited answer. In this final part, the same machinery gets pointed at a different goal: <em>teaching you</em> — and then we close the series by auditing the project’s central promise: nothing leaves your machine.</p> <h2 id="one-recipe-four-study-tools">One recipe, four study tools</h2> <p>CogniVault generates quizzes, multi-lesson workshops, flashcard decks, and mindmaps from your documents. Four different outputs — but under the hood, one shared five-step recipe:</p> <ol> <li> <p><strong>Retrieve.</strong> The same hybrid search from Part 3, but instead of your question, the probe is a broad query like <em>“key concepts, definitions, important facts, main ideas”</em>, scoped to the documents you selected. Up to 15 representative chunks come back.</p> </li> <li> <p><strong>Prompt from a template.</strong> The instructions sent to Gemma are not buried in Python — they’re editable Markdown files in <code>backend/prompts/</code> (<code>quiz.md</code>, <code>flashcards.md</code>, and so on). Drop a modified copy into <code>backend/prompts/custom/</code> and it overrides the shipped version on the very next request. No restart, no code change. Prompt engineering as configuration.</p> </li> <li> <p><strong>Constrain the output.</strong> Asking a small local model to “please return JSON” works most of the time — and <em>most of the time</em> is a production bug. CogniVault uses Ollama’s grammar-constrained generation (<code>format="json"</code>), which makes invalid JSON impossible rather than unlikely, plus low temperature for consistency. The full saga of getting reliable structure out of a 4-billion-parameter model is in .</p> </li> <li> <p><strong>Validate defensively.</strong> Every generated item is checked field by field, and malformed items are <em>dropped</em> rather than failing the whole batch. Small models occasionally fumble one question out of ten; a product shouldn’t collapse because of it.</p> </li> <li> <p><strong>Persist.</strong> Everything lands in SQLite, so quizzes are resumable, workshop progress survives restarts, and flashcard statuses are remembered per deck.</p> </li> </ol> <p>Here’s the recipe in motion for a quiz:</p> <div class="mermaid">%%{init: {'sequence': {'actorFontSize': 28, 'messageFontSize': 24, 'loopTextFontSize': 22, 'noteFontSize': 22}}}%% sequenceDiagram actor U as You participant F as Study Hub UI participant B as FastAPI participant V as VectorDB participant O as Ollama (gemma4:e4b) participant S as SQLite U->>F: Pick scope, difficulty, question count F->>B: POST /api/study/quiz/generate B->>V: Hybrid search, scoped to your documents V-->>B: Up to 15 representative chunks B->>B: Render the quiz.md prompt template B->>O: chat(format="json", low temperature) O-->>B: Grammar-constrained JSON B->>B: Validate each question, drop bad ones B->>S: Save quiz (resumable later) B-->>F: Typed response F-->>U: Play, submit, score — and maybe a new badge </div> <p>The four tools differ only in their template and their shape: quizzes produce multiple-choice and true/false questions with explanations; workshops produce an outline first and then write each lesson <em>on demand</em> when you open it; flashcards produce front/back pairs; mindmaps produce a topic tree that the frontend renders as an interactive diagram. (That renderer is its own adventure: .)</p> <h2 id="sessions-that-track-themselves">Sessions that track themselves</h2> <p>Most study apps make you press a start button, and most people forget. CogniVault takes a different stance: <strong>study sessions are inferred, not declared</strong>.</p> <p>Every chat message either extends the current session or — after a 15-minute idle gap — quietly starts a new one. Walk away for coffee, come back, keep working: same session. Come back tomorrow: new session. No buttons, no forgetting.</p> <p>Each message also records a tiny event (timestamp, whether you used a scope filter or attachments) into <code>progress.db</code> — a SQLite database, which is a complete relational database living in a single file. Eleven tables hold everything: sessions, message events, earned badges, quiz attempts and saved quizzes, workshops and lessons, decks and cards, and mindmaps.</p> <p>One engineering note worth copying: the tracking call inside the chat endpoint is wrapped so that it can <em>never</em> block or break the chat. Analytics must be a passenger, never a driver.</p> <h2 id="25-badges-defined-as-data">25 badges, defined as data</h2> <p>The achievements aren’t scattered through the code as <code>if</code> statements. They live in one JSON file — 25 entries, each with a code, a name, an icon, the metric it watches, and a target. After each relevant action, an evaluator checks every definition against the database and persists anything newly earned. Some badges form ladders, each pointing to its next level.</p> <p>Declarative beats imperative here for a simple reason: adding badge number 26 means adding a JSON entry, not writing new logic. The design behind the streaks, the idle-gap rule, and the 90-day heatmap got its own post: .</p> <h2 id="voice-input-without-a-cloud-microphone">Voice input, without a cloud microphone</h2> <p>The microphone button is powered by <strong>faster-whisper</strong> — OpenAI’s Whisper speech-recognition model re-implemented on a faster inference engine — running on your CPU with int8 quantisation (8-bit numbers instead of 32-bit: smaller, faster, accurate enough). No audio ever leaves the machine.</p> <p>The model is lazy-loaded on the first transcription so app startup stays instant, and if faster-whisper isn’t installed at all, the frontend simply hides the mic button. Features should degrade, not detonate.</p> <h2 id="the-privacy-receipts">The privacy receipts</h2> <p>The series began with a promise: <em>nothing leaves your machine.</em> Promises are cheap — here’s the audit. Every byte CogniVault stores, and where it lives:</p> <table> <thead> <tr> <th>Data</th> <th>Location</th> <th>Format</th> </tr> </thead> <tbody> <tr> <td>Your uploaded files</td> <td><code>docs/</code> folder</td> <td>The original files</td> </tr> <tr> <td>Search vectors</td> <td><code>vector_store.faiss</code></td> <td>FAISS binary index</td> </tr> <tr> <td>Chunk text and metadata</td> <td><code>vector_store.json</code></td> <td>JSON</td> </tr> <tr> <td>File-to-category map</td> <td><code>categories.json</code></td> <td>JSON</td> </tr> <tr> <td>Chat sessions</td> <td><code>chat_history.json</code></td> <td>JSON</td> </tr> <tr> <td>Sessions, badges, quizzes, workshops, decks, mindmaps</td> <td><code>progress.db</code></td> <td>SQLite</td> </tr> <tr> <td>Ingestion checkpoints</td> <td>PostgreSQL (local Docker volume)</td> <td>DBOS system tables</td> </tr> <tr> <td>The AI models themselves</td> <td>Ollama’s local model store</td> <td>Model weights</td> </tr> </tbody> </table> <p>Nothing in that table is on someone else’s computer. Inference goes to <code>localhost</code>. Embeddings go to <code>localhost</code>. The only outbound request the backend ever makes is the URL-import feature — at your explicit request, and guarded against fetching private addresses. The app even surfaces these stats live in its Privacy Vault Audit panel.</p> <p>And because trust needs more than a table: the whole backend is covered by a pytest suite you can run yourself — the approach is documented in .</p> <h2 id="series-wrap-up">Series wrap-up</h2> <p>Four parts, one architecture:</p> <ol> <li><strong> </strong> — three processes, four layers, and a decoder ring for the jargon</li> <li><strong> </strong> — a durable, format-aware pipeline that turns any document into searchable vectors</li> <li><strong> </strong> — two retrievers covering each other’s blind spots, fused by rank, driven by an agent</li> <li><strong>Part 4</strong> — the same machinery generating study materials, tracking progress without buttons, and a storage map with no cloud rows in it</li> </ol> <p>If there’s one theme, it’s this: <strong>boring, verifiable choices in service of privacy</strong>. Exact search instead of approximate. SQLite files instead of hosted databases. Grammar-constrained JSON instead of hopeful parsing. Soft deletes instead of clever index surgery. Every piece is something you can open, read, and check — which is exactly the point.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The structured format the generators force the model to produce</td> </tr> <tr> <td><strong>SQLite / SQL</strong></td> <td>(SQL = Structured Query Language)</td> <td>A complete relational database living in one file, <code>progress.db</code></td> </tr> <tr> <td><strong>MCQ</strong></td> <td>Multiple-Choice Question</td> <td>One of the two quiz question types (the other is true/false)</td> </tr> <tr> <td><strong>CPU</strong></td> <td>Central Processing Unit</td> <td>Where Whisper runs — no graphics card required</td> </tr> <tr> <td><strong>int8</strong></td> <td>8-bit integer (quantisation)</td> <td>Storing model weights as small integers: smaller, faster, accurate enough</td> </tr> <tr> <td><strong>AI</strong></td> <td>Artificial Intelligence</td> <td>Software performing tasks that normally need human intelligence</td> </tr> <tr> <td><strong>API</strong></td> <td>Application Programming Interface</td> <td>The endpoints the Study Hub and dashboard call</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>The vector index in the privacy-receipts table</td> </tr> <tr> <td><strong>DBOS</strong></td> <td>Database-Oriented Operating System</td> <td>The durable-workflow library whose checkpoints live in PostgreSQL</td> </tr> <tr> <td><strong>SSRF</strong></td> <td>Server-Side Request Forgery</td> <td>The attack class the URL importer guards against</td> </tr> <tr> <td><strong>PNG / PDF</strong></td> <td>Portable Network Graphics / Portable Document Format</td> <td>Two of the mindmap export formats (plus Markdown)</td> </tr> <tr> <td><strong>SVG</strong></td> <td>Scalable Vector Graphics</td> <td>The browser drawing format behind the interactive mindmap rendering</td> </tr> </tbody> </table> <hr> <p><strong>Next steps:</strong> clone and read along — the README maps the full architecture, and every claim in this series can be checked directly against the code in <code>backend/</code>. And if you want the deep-dive versions of these topics, the picks up where this tour ends.</p> </article> <article> <h1>Part 3 · CogniVault Architecture: Why We Keep Ollama Out of Docker</h1> <p>Wed, 03 Jun 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>The golden rule of modern software deployment is containerization. Put everything in Docker to isolate the dependencies, and it will run the exact same way on every machine.</p> <p>When initially designing CogniVault, the impulse was to put the FastAPI server, the PostgreSQL database, and the Ollama LLM engine all inside a single, secure Docker network.</p> <p>But we didn’t. We left Ollama running natively on the host machine. Let’s break down why.</p> <h2 id="the-gpu-passthrough-problem">The GPU Passthrough Problem</h2> <p>Think of your GPU like the kitchen in a restaurant. The chefs (your AI models) need to <em>be in the kitchen</em> — standing at the stove, hands on the equipment. Now imagine telling the chefs they must cook from a sealed meeting room down the hall, passing instructions through a serving hatch. Technically food might still come out. It will not come out fast.</p> <p>That sealed room is a container. Large Language Models like Gemma 4 need direct, unhindered access to your hardware’s GPU (like Apple Silicon’s Unified Memory or a dedicated Nvidia card) to generate text fast enough for a real-time chat interface. And the picture varies by platform:</p> <ul> <li><strong>On macOS</strong>, Docker runs containers inside a lightweight virtual machine — and there is currently <strong>no GPU (Metal) passthrough at all</strong>. An Ollama container on a Mac runs CPU-only. For a chat app, that’s disqualifying on its own.</li> <li><strong>On Linux</strong>, Nvidia GPU passthrough exists and works, but it requires extra toolkit configuration that breaks the “it just works” philosophy of local development.</li> </ul> <p>Running Ollama natively sidesteps the whole category of problems.</p> <h2 id="the-bridge-solution">The Bridge Solution</h2> <p>CogniVault uses a split deployment model, separating the application logic from the heavy AI processing.</p> <ol> <li><strong>The Secure Rooms (Docker):</strong> PostgreSQL — which holds the DBOS workflow ledger from — lives in a <strong>Docker Bridge Network</strong> (a private virtual network). Isolated, clean, reproducible.</li> <li><strong>The Main Building (Native Host):</strong> Ollama runs directly on your Mac, Windows, or Linux host OS, giving it direct metal access to your GPU.</li> </ol> <p>CogniVault actually ships <strong>two run modes</strong>, and it’s worth being precise about them:</p> <ul> <li><strong>The default (<code>scripts/start.sh</code>):</strong> only PostgreSQL runs in Docker. The FastAPI backend runs natively too (<code>python -m backend.main</code>), right next to Ollama. Simplest possible loop for local development.</li> <li><strong>The fully containerized mode (<code>docker-compose.yaml</code>):</strong> the FastAPI app joins Postgres inside the compose network. In this mode the app container reaches the native Ollama engine through a special Docker routing address: <code>host.docker.internal:11434</code>.</li> </ul> <p>Either way, the rule stays the same: <strong>the model never goes in the box.</strong></p> <div class="mermaid">graph TD Client[📱 Browser / User] -->|HTTP: 8000| App subgraph Host Machine [Host OS: Native GPU Access] Ollama[🧠 Ollama Engine] Models[(gemma4:e4b)] Ollama <--> Models subgraph Docker Compose Network App[🖥️ FastAPI App Container] Postgres[(🐘 PostgreSQL)] App <-->|Internal Port 5432| Postgres end App <-->|host.docker.internal:11434| Ollama end </div> <h3 id="what-about-the-vector-database">What about the Vector Database?</h3> <p>You might notice FAISS isn’t a container here. Unlike massive SQL databases, FAISS is extremely lightweight. In CogniVault, FAISS runs directly inside the FastAPI Python process’s memory and saves its data to a local folder. It doesn’t need its own container.</p> <p>By keeping the heavy LLM lifting on the metal and the bookkeeping in containers, we get the balance that notoriously trips up local AI development: zero dependency conflicts combined with maximum AI performance.</p> <hr> <h3 id="see-it-in-action">See It In Action</h3> <p>That wraps up the CogniVault architecture series! If you want to run this 100% local, privacy-first Study Companion on your own hardware:</p> <ul> <li><strong>Grab the code:</strong> </li> <li><strong>Watch the walkthrough:</strong> </li> </ul> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>GPU</strong></td> <td>Graphics Processing Unit</td> <td>The hardware that makes local model inference fast; containers struggle to reach it</td> </tr> <tr> <td><strong>LLM</strong></td> <td>Large Language Model</td> <td>A neural network trained on huge amounts of text that can read and generate language</td> </tr> <tr> <td><strong>AI</strong></td> <td>Artificial Intelligence</td> <td>Software performing tasks that normally need human intelligence</td> </tr> <tr> <td><strong>API</strong></td> <td>Application Programming Interface</td> <td>The set of URLs the frontend calls to talk to the backend</td> </tr> <tr> <td><strong>HTTP</strong></td> <td>HyperText Transfer Protocol</td> <td>The protocol browsers and APIs use to exchange requests and responses</td> </tr> <tr> <td><strong>OS</strong></td> <td>Operating System</td> <td>macOS, Windows, or Linux — where Ollama runs natively</td> </tr> <tr> <td><strong>DBOS</strong></td> <td>Database-Oriented Operating System</td> <td>The durable-workflow library whose ledger lives in the Postgres container (see Part 2)</td> </tr> <tr> <td><strong>SQL</strong></td> <td>Structured Query Language</td> <td>The language of relational databases like PostgreSQL</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>The in-process vector index — deliberately <em>not</em> a separate container</td> </tr> <tr> <td><strong>VM</strong></td> <td>Virtual Machine</td> <td>The hidden layer Docker uses on macOS — and the reason Mac containers can’t reach the GPU</td> </tr> </tbody> </table> </article> <article> <h1>Part 2 · CogniVault Architecture: Durable Ingestion with DBOS</h1> <p>Tue, 02 Jun 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>In a basic local AI setup, adding documents to your database is usually just a simple Python script. You open a PDF, chop the text into chunks, turn those chunks into math (embeddings), and save them.</p> <p>This works great for a five-page essay. But what happens when you are ingesting a 1,000-page technical manual and your laptop goes to sleep at page 800?</p> <p>The script dies. When you wake your laptop up, you have to start all over from page 1, wasting time and compute power. A simple script wasn’t going to cut it for CogniVault. We needed a <strong>Durable Workflow</strong>.</p> <h2 id="the-factory-ledger-dbos">The Factory Ledger (DBOS)</h2> <p>Think of data ingestion like a factory assembly line. If the power goes out, the workers shouldn’t have to rebuild every product from scratch. They should just look at a permanent ledger, see exactly which box they were packing when the lights went out, and resume from there.</p> <p>CogniVault uses a framework called <strong>DBOS (Database-Oriented Operating System)</strong> backed by a PostgreSQL database to act as this ledger.</p> <p>Every step of the ingestion process records its completion in Postgres. If the server crashes mid-way, nothing dramatic happens in the moment — the magic is on restart: DBOS reads the ledger, sees which steps already finished, replays their recorded results instantly, and resumes from the first unfinished step.</p> <p>One important boundary: Postgres holds <strong>only the ledger</strong> — which steps ran and what they returned. Your documents, chunks, and vectors never live there. They go to a FAISS index plus a JSON metadata file on disk.</p> <h2 id="sha-256-hashing-the-idempotency-trick">SHA-256 Hashing: The Idempotency Trick</h2> <p>The system also needs to be smart about re-uploads. If you fix a typo in a massive document and upload it again, you don’t want the system to waste 10 minutes re-embedding the whole thing.</p> <p>CogniVault achieves <strong>Idempotency</strong> (the ability to run the same operation multiple times without changing the result beyond the initial application) with the workflow’s very first step: it scans the <code>docs/</code> folder and generates a <strong>SHA-256 hash</strong> (a unique digital fingerprint) for every file.</p> <ul> <li>If the hash is new, it processes the file.</li> <li>If the hash has changed (because you edited the file), it soft-deletes the old text chunks and only re-embeds the new version.</li> <li>If the hash is identical, it skips the file entirely.</li> </ul> <p>We can see here how this flows logically:</p> <div class="mermaid">graph TD Raw[📄 Uploaded Document] --> DBOS[🐘 DBOS Workflow Starts] subgraph Durable Ingestion Pipeline DBOS -->|Step 1| Hash{Hash Check SHA-256} Hash -->|Unchanged| Skip[Skip Processing] Hash -->|New / Changed| Extract[✂️ Step 2: Extract Text per Document] Extract --> Chunk[Chunk: 1000 chars, 100 overlap] Chunk -->|Step 3, batches of 5| Embed[🔢 embeddinggemma Embeddings] Embed -->|Step 4| Save[(💾 FAISS Index + Metadata JSON)] end Save -->|Workflow Complete| Done[✅ Ready for Search] </div> <p>(A detail for the curious: the checkpointed <em>steps</em> are the scan, the per-document extraction, each embedding batch, and the save. The chunking in between is fast pure-Python work, so it simply re-runs as part of the workflow body — checkpointing it would cost more than redoing it.)</p> <hr> <h3 id="whats-next">What’s Next?</h3> <p>By wrapping the ingestion pipeline in DBOS, the system transforms from a fragile script into a resilient, production-grade state machine.</p> <p>Now that our data is safely ingested, how do we deploy this entire pipeline without melting our laptop’s GPU? <strong>Read Part 3: Why We Keep Ollama Out of Docker</strong></p> <p><em>You can also explore the DBOS implementation directly in the <code>backend/services/ingest.py</code> file of the .</em></p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>DBOS</strong></td> <td>Database-Oriented Operating System</td> <td>A library that checkpoints workflow steps in a database so crashed jobs resume instead of restarting</td> </tr> <tr> <td><strong>SHA-256</strong></td> <td>Secure Hash Algorithm, 256-bit</td> <td>A fingerprint function: any file maps to a unique 64-character hash; change one byte and the hash changes completely</td> </tr> <tr> <td><strong>PDF</strong></td> <td>Portable Document Format</td> <td>The document format whose text (and scans) the pipeline extracts</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>Meta’s vector-search library — where the embeddings actually live</td> </tr> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The text format used for the chunk-metadata file stored next to the FAISS index</td> </tr> <tr> <td><strong>AI</strong></td> <td>Artificial Intelligence</td> <td>Software performing tasks that normally need human intelligence</td> </tr> <tr> <td><strong>GPU</strong></td> <td>Graphics Processing Unit</td> <td>The hardware that makes local model inference fast — the subject of Part 3</td> </tr> </tbody> </table> </article> <article> <h1>Part 1 · CogniVault Architecture: Why Standard RAG Isn't Enough (Hybrid Search)</h1> <p>Mon, 01 Jun 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>Vector search is the process of finding the most similar items in a dataset based on their vector embeddings. This is how RAG systems usually work. But what happens when you need to find the most similar items in a dataset based not only on their semantic meaning but also on the exact wording of the query?</p> <p>This becomes critical when the information you’re looking for isn’t just related but must match a specific string or keyword exactly.</p> <h2 id="two-ways-of-finding-a-book">Two ways of finding a book</h2> <p>Picture a good local bookshop. The owner has read everything, and she recommends by <em>feel</em>. Tell her you loved <em>The Martian</em> and she hands you <em>Project Hail Mary</em> — different title, different plot, but the same DNA: a lone scientist, an impossible survival problem, jokes under pressure. Ask for “something like <em>Pride and Prejudice</em>” and you’ll walk out with <em>Emma</em>. She isn’t matching words. She’s matching <em>meaning</em>.</p> <p>Now ask her a different kind of question: “I need the book with ISBN 978-0-553-41802-6,” or “the manual that mentions error code 404B on the cover.” Her superpower is useless here. No amount of literary intuition finds an exact string. For that, you walk to the till and check the <strong>catalogue</strong> — a boring, literal index that knows exactly which shelf holds which identifier, and nothing about vibes.</p> <p>A well-run bookshop needs both. So does a well-run RAG system:</p> <ol> <li><strong>FAISS — Facebook AI Similarity Search (the well-read owner):</strong> a vector index that finds chunks of text whose <em>meaning</em> is mathematically close to your prompt. Brilliant for “how is the practical exam structured?”, blind to “§3 Absatz 2”.</li> <li><strong>BM25 — Best Match 25 (the catalogue):</strong> a classic keyword-scoring algorithm that rewards exact word matches, weighted by how rare and distinctive those words are. Brilliant for identifiers and quoted phrases, blind to paraphrase.</li> </ol> <p>CogniVault runs <strong>both</strong> retrievers on every search — this is <strong>Hybrid Search</strong> — and then merges the two ranked lists with a formula called <strong>Reciprocal Rank Fusion (RRF)</strong>. RRF scores each chunk purely by its <em>position</em> in each list: a chunk ranked highly by either retriever scores well, and a chunk both retrievers agree on rises to the top. Because only ranks are used, the two retrievers’ incompatible scoring scales never have to be reconciled.</p> <h2 id="the-agent-decides-when-to-search">The agent decides when to search</h2> <p>Here’s the part most diagrams get backwards (mine included, in an earlier draft): retrieval doesn’t happen <em>before</em> the model gets involved. It happens <em>inside</em> the model’s own loop.</p> <p>CogniVault wraps Gemma in the <strong>Strands Agents SDK</strong>. The model receives your question along with a set of <strong>Tools</strong> (pre-written Python functions like <code>search_knowledge_base</code>, <code>calculator</code>, or <code>compare_documents</code>). It then reasons about the question and <em>decides for itself</em> whether — and which — tools to call. For most document questions it calls <code>search_knowledge_base</code>, reads the retrieved chunks, and only then writes its answer, grounded in what it found.</p> <p>Here is the architectural blueprint of that loop:</p> <div class="mermaid">graph TD Client[📱 User Query] --> App[🖥️ FastAPI Server] subgraph AgentLoop["The Strands Agent Loop (powered by Gemma 4)"] App --> Agent[🧠 Agent reasons about the question] Agent -->|Decides to search| Search[search_knowledge_base] subgraph Hybrid Search Engine Search -->|Semantic| FAISS[(FAISS Vector)] Search -->|Exact match| BM25[(BM25 Keyword)] FAISS --> RRF{RRF Fusion} BM25 --> RRF end RRF -->|Best chunks + citations| Agent Agent -->|Grounded answer| Answer[Streamed response] end Answer --> Client </div> <p>One subtlety worth noting: the agent <em>is</em> Gemma. There is no separate “formatting model” at the end — the same model that decided to search also writes the final answer, now with the retrieved chunks in front of it.</p> <hr> <h3 id="whats-next">What’s Next?</h3> <p>Building a toy RAG app is easy, but building one that actually retrieves the exact document you need requires hybrid engines and an agent that knows when to use them.</p> <p>Want to see how this system safely ingests massive documents without losing work when something crashes? <strong>Read Part 2: Durable Ingestion with DBOS</strong></p> <p><em>Or, if you prefer to jump straight into the code, the hybrid search lives in <code>backend/services/vector_db.py</code> of the .</em></p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>RAG</strong></td> <td>Retrieval-Augmented Generation</td> <td>Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>Meta’s library for storing vectors and finding the most similar ones fast</td> </tr> <tr> <td><strong>BM25</strong></td> <td>Best Match 25</td> <td>A keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system</td> </tr> <tr> <td><strong>RRF</strong></td> <td>Reciprocal Rank Fusion</td> <td>A formula that merges multiple ranked lists using only each item’s rank: <code>score = Σ 1/(k + rank)</code></td> </tr> <tr> <td><strong>LLM</strong></td> <td>Large Language Model</td> <td>A neural network trained on huge amounts of text that can read and generate language</td> </tr> <tr> <td><strong>SDK</strong></td> <td>Software Development Kit</td> <td>A library of building blocks — here, Strands, which provides the agent loop</td> </tr> <tr> <td><strong>API</strong></td> <td>Application Programming Interface</td> <td>The set of URLs the frontend calls to talk to the backend</td> </tr> <tr> <td><strong>ISBN</strong></td> <td>International Standard Book Number</td> <td>The unique identifier printed on every published book — the catalogue’s best friend</td> </tr> </tbody> </table> </article> <article> <h1>Gemma CogniVault</h1> <p>Mon, 25 May 2026 00:00:00 +0000</p> <h2 id="overview">Overview</h2> <p><strong>Gemma CogniVault</strong> is a 100% local, privacy-first AI study companion. Your documents stay on your hardware. Inference runs via Ollama on <code>localhost</code>. No telemetry, no embeddings sent to third parties, no exceptions. A live Privacy Vault Audit Panel confirms zero external connections at runtime.</p> <p>It’s also genuinely capable — Gemma 4’s full surface (completion, vision, tools, reasoning) running on your laptop, wrapped in an app that turns your documents into <strong>quizzes, multi-lesson workshops, flashcard decks, and visual mindmaps</strong>, with a learning-progress dashboard and 25 achievement badges.</p> <h2 id="whats-inside">What’s inside</h2> <table> <thead> <tr> <th>Layer</th> <th>Technology</th> </tr> </thead> <tbody> <tr> <td><strong>LLM & Embeddings</strong></td> <td>Ollama · <code>gemma4:e4b</code> · <code>embeddinggemma</code></td> </tr> <tr> <td><strong>Agent Framework</strong></td> <td>Strands Agents SDK</td> </tr> <tr> <td><strong>Backend</strong></td> <td>FastAPI · Python 3.10+ · Pydantic</td> </tr> <tr> <td><strong>Vector Search</strong></td> <td>FAISS IndexFlatIP + BM25Okapi · Reciprocal Rank Fusion</td> </tr> <tr> <td><strong>Document Parsing</strong></td> <td>pypdf · python-docx · python-pptx · openpyxl · trafilatura</td> </tr> <tr> <td><strong>OCR</strong></td> <td>pytesseract · pymupdf · Pillow</td> </tr> <tr> <td><strong>Audio</strong></td> <td>faster-whisper</td> </tr> <tr> <td><strong>Workflow Engine</strong></td> <td>DBOS + PostgreSQL</td> </tr> <tr> <td><strong>Frontend</strong></td> <td>React 19 · TypeScript · Vite · Tailwind v4 · Framer Motion · TanStack Query</td> </tr> </tbody> </table> <h2 id="four-sections">Four sections</h2> <table> <thead> <tr> <th>Section</th> <th>What it’s for</th> </tr> </thead> <tbody> <tr> <td><strong>💬 Chat</strong></td> <td>Ask anything about your documents. Cited answers, scope filter, voice, attachments.</td> </tr> <tr> <td><strong>📚 Knowledge Base</strong></td> <td>Upload, categorise, and manage your documents. SHA-256 change detection on re-upload.</td> </tr> <tr> <td><strong>🎓 Study Hub</strong></td> <td>Four AI-powered study modes: Quiz · Workshop · Flashcards · Mindmaps.</td> </tr> <tr> <td><strong>📊 Dashboard</strong></td> <td>Total study time, current streak, 25 achievement badges, 90-day activity heatmap.</td> </tr> </tbody> </table> <h2 id="highlights">Highlights</h2> <ul> <li><strong>🧠 Thinking Mode</strong> — collapsible reasoning panel streams Gemma 4’s chain of thought before the answer</li> <li><strong>🔍 Hybrid Retrieval</strong> — FAISS dense + BM25 keyword fused with Reciprocal Rank Fusion</li> <li><strong>🖼️ Multimodal</strong> — attach images, PDFs, and DOCX inline in chat</li> <li><strong>🛟 Durable workflows</strong> — DBOS-checkpointed ingestion; crash-safe and resumable</li> <li><strong>🏆 25 achievement badges</strong> — auto-tracked across chat, quizzes, workshops, flashcards, mindmaps</li> <li><strong>🔒 Vault Audit Panel</strong> — live “zero external connections” indicator</li> </ul> <h2 id="writing-about-it">Writing about it</h2> <p>I’m publishing a series of posts unpacking the engineering decisions behind CogniVault — privacy framing, the retrieval stack, the agent loop, ingestion durability, getting JSON out of a local model, drawing mindmaps without a graph library, the gamification layer, and how the test suite avoids needing any infrastructure to run.</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>See the for the full series.</p> </blockquote> <h2 id="try-it">Try it</h2> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">git clone https://github.com/ndimoforaretas/local-gemma-rag.git </span></span><span class="line"><span class="cl"><span class="nb">cd</span> local-gemma-rag </span></span><span class="line"><span class="cl">./scripts/setup.sh <span class="c1"># one-time</span> </span></span><span class="line"><span class="cl">./scripts/start.sh </span></span></code></pre></div><p>Then open .</p> </article> <article> <h1>Part 8 · Testing a Local-AI App: 351 Tests, Zero Infrastructure</h1> <p>Mon, 25 May 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>Part of a series on building . Previously: . All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>CogniVault has <strong>351 tests across 22 files</strong> (at the time of writing — the suite grows with the app). None of them need Ollama. None of them need Postgres. None of them need a real PDF, a microphone, or an internet connection. The whole suite runs in <strong>about three seconds</strong> on my laptop.</p> <p>That’s not because there isn’t much to test — the surface is wide. It’s because the test suite is built around one principle: <strong>mock at the edge, real everywhere else.</strong> This post is about what “the edge” means in a local-AI app, and how to draw the line so the suite stays useful instead of decorative.</p> <h2 id="the-22-test-files">The 22 test files</h2> <table> <thead> <tr> <th>File</th> <th>What it covers</th> </tr> </thead> <tbody> <tr> <td><code>test_api.py</code></td> <td>The HTTP endpoints (upload, ingest, RAG, history, KB browsing)</td> </tr> <tr> <td><code>test_tools.py</code></td> <td>Calculator, clock, KB search tool</td> </tr> <tr> <td><code>test_thinking.py</code></td> <td>Two-phase stream, thinking tokens, session isolation</td> </tr> <tr> <td><code>test_chat_attachments.py</code></td> <td>Multi-file attach, PDF/DOCX extraction, size limits</td> </tr> <tr> <td><code>test_chat_memory.py</code></td> <td>Session history budget, trimming, restart rebuild</td> </tr> <tr> <td><code>test_doc_scope_filter.py</code></td> <td>Per-request ContextVar isolation, search filtering</td> </tr> <tr> <td><code>test_doc_tools.py</code></td> <td><code>list_documents</code>, <code>analyze_document</code>, <code>compare_documents</code></td> </tr> <tr> <td><code>test_edit_regenerate.py</code></td> <td>History rewind, trim_history_to_turns validation</td> </tr> <tr> <td><code>test_structure_chunking.py</code></td> <td>Markdown header splits, CSV row batches, doc types</td> </tr> <tr> <td><code>test_ocr_fallback.py</code></td> <td>OCR trigger threshold, graceful degradation</td> </tr> <tr> <td><code>test_new_formats.py</code></td> <td>PPTX, XLSX, HTML extractors, extension routing</td> </tr> <tr> <td><code>test_docx_url.py</code></td> <td>DOCX ingestion and URL import (with the SSRF guard)</td> </tr> <tr> <td><code>test_reingest.py</code></td> <td>SHA-256 change detection, idempotency</td> </tr> <tr> <td><code>test_vector_db.py</code></td> <td>BM25, FAISS, RRF fusion, hybrid search</td> </tr> <tr> <td><code>test_audio.py</code></td> <td>Whisper transcription endpoint</td> </tr> <tr> <td><code>test_progress.py</code></td> <td>Sessions, daily aggregation, achievement criteria</td> </tr> <tr> <td><code>test_prompts.py</code></td> <td>The prompt-template loader and custom overrides</td> </tr> <tr> <td><code>test_vault_stats.py</code></td> <td>The Privacy Vault Audit numbers</td> </tr> <tr> <td><code>test_quiz.py</code> / <code>test_workshop.py</code> / <code>test_flashcards.py</code> / <code>test_mindmaps.py</code></td> <td>Per-mode parsing, endpoints, achievements</td> </tr> </tbody> </table> <p>Everything that <em>can</em> be tested in isolation is tested in isolation. Everything that needs to be tested through the FastAPI layer is, but the <em>only</em> things mocked are the calls that cross the process boundary.</p> <h2 id="what-gets-mocked-what-doesnt">What gets mocked, what doesn’t</h2> <p>The single most important question in a project like this: where do you stub?</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="p">[</span> <span class="n">React</span> <span class="n">frontend</span> <span class="p">]</span> <span class="err">←─</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">scope</span> <span class="k">for</span> <span class="n">backend</span> <span class="n">tests</span> </span></span><span class="line"><span class="cl"> <span class="err">│</span> </span></span><span class="line"><span class="cl"> <span class="err">▼</span> </span></span><span class="line"><span class="cl"><span class="p">[</span> <span class="n">FastAPI</span> <span class="n">handlers</span> <span class="p">]</span> <span class="err">←─</span> <span class="n">tested</span> <span class="n">directly</span> <span class="n">with</span> <span class="n">TestClient</span> </span></span><span class="line"><span class="cl"> <span class="err">│</span> </span></span><span class="line"><span class="cl"> <span class="err">▼</span> </span></span><span class="line"><span class="cl"><span class="p">[</span> <span class="n">services</span><span class="o">/</span> <span class="p">]</span> <span class="err">←─</span> <span class="n">tested</span> <span class="n">directly</span> <span class="p">(</span><span class="n">vector_db</span><span class="p">,</span> <span class="n">rag_agent</span><span class="p">,</span> <span class="n">generators</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="err">│</span> </span></span><span class="line"><span class="cl"> <span class="err">├─►</span> <span class="p">[</span> <span class="n">FAISS</span> <span class="o">+</span> <span class="n">BM25</span> <span class="p">]</span> <span class="err">←─</span> <span class="n">real</span><span class="p">,</span> <span class="ow">in</span><span class="o">-</span><span class="n">memory</span><span class="p">,</span> <span class="n">fast</span> </span></span><span class="line"><span class="cl"> <span class="err">├─►</span> <span class="p">[</span> <span class="n">SQLite</span> <span class="p">]</span> <span class="err">←─</span> <span class="n">real</span><span class="p">,</span> <span class="n">against</span> <span class="n">a</span> <span class="n">tmp_path</span> <span class="n">file</span> </span></span><span class="line"><span class="cl"> <span class="err">├─►</span> <span class="p">[</span> <span class="n">DBOS</span> <span class="p">]</span> <span class="err">←─</span> <span class="n">patched</span> <span class="p">(</span><span class="n">no</span> <span class="n">launch</span><span class="p">,</span> <span class="n">no</span> <span class="n">Postgres</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="err">├─►</span> <span class="p">[</span> <span class="n">Ollama</span> <span class="p">]</span> <span class="err">←─</span> <span class="n">patched</span> <span class="n">at</span> <span class="n">each</span> <span class="n">service</span><span class="s1">'s import site</span> </span></span><span class="line"><span class="cl"> <span class="err">└─►</span> <span class="p">[</span> <span class="n">Whisper</span> <span class="p">]</span> <span class="err">←─</span> <span class="n">stubbed</span> <span class="p">(</span><span class="n">no</span> <span class="mi">145</span> <span class="n">MB</span> <span class="n">model</span> <span class="nb">load</span><span class="p">)</span> </span></span></code></pre></div><p>The rule of thumb: <strong>anything that crosses a process or network boundary, mock. Anything in-process, run for real.</strong></p> <p>FAISS and BM25 are real because they’re libraries we link into the test process. SQLite is real because it’s a file. DBOS is patched because launching it expects a Postgres connection, and that’s network. Ollama is patched because it’s HTTP. Whisper is stubbed because loading a 145 MB model in a unit test is silly.</p> <p>That principle keeps the test suite fast (no I/O the OS can’t handle in milliseconds) and meaningful (the real code paths through retrieval, chunking, parsing, scope filtering all execute).</p> <h2 id="mocking-ollama">Mocking Ollama</h2> <p>Most CogniVault tests need <em>some</em> model output, but they don’t care what model produced it. Each service imports the <code>ollama</code> module directly, so the tests patch that reference <strong>at the service’s own import site</strong>:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Real pattern from test_quiz.py</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">unittest.mock</span> <span class="kn">import</span> <span class="n">patch</span> </span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">backend.services</span> <span class="kn">import</span> <span class="n">quiz_generator</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">test_quiz_parses_questions</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="n">fake</span> <span class="o">=</span> <span class="p">{</span><span class="s2">"message"</span><span class="p">:</span> <span class="p">{</span><span class="s2">"content"</span><span class="p">:</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">({</span><span class="s2">"questions"</span><span class="p">:</span> <span class="p">[</span><span class="n">VALID_MCQ</span><span class="p">]</span> <span class="o">*</span> <span class="mi">5</span><span class="p">})}}</span> </span></span><span class="line"><span class="cl"> <span class="k">with</span> <span class="n">patch</span><span class="o">.</span><span class="n">object</span><span class="p">(</span><span class="n">quiz_generator</span><span class="p">,</span> <span class="s2">"ollama"</span><span class="p">)</span> <span class="k">as</span> <span class="n">mock_ollama</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">mock_ollama</span><span class="o">.</span><span class="n">chat</span><span class="o">.</span><span class="n">return_value</span> <span class="o">=</span> <span class="n">fake</span> </span></span><span class="line"><span class="cl"> <span class="n">result</span> <span class="o">=</span> <span class="n">quiz_generator</span><span class="o">.</span><span class="n">generate_quiz</span><span class="p">(</span> </span></span><span class="line"><span class="cl"> <span class="n">difficulty</span><span class="o">=</span><span class="s2">"beginner"</span><span class="p">,</span> <span class="n">num_questions</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">question_types</span><span class="o">=</span><span class="p">[</span><span class="s2">"mcq"</span><span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">questions</span><span class="p">)</span> <span class="o">==</span> <span class="mi">5</span> </span></span></code></pre></div><p>A streaming variant feeds chunk sequences instead of a single response, used by the RAG and thinking tests. The key property: one <code>patch.object</code> against the module the service actually uses. No deep mock hierarchies, no fragile string paths into third-party internals. Easy to read in a code review, easy to debug when a test fails.</p> <h2 id="mocking-dbos">Mocking DBOS</h2> <p>DBOS expects <code>launch()</code> to connect to Postgres. The shared <code>client</code> fixture in <code>conftest.py</code> simply patches the <code>dbos</code> instance before the app is exercised:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Real pattern from conftest.py</span> </span></span><span class="line"><span class="cl"><span class="nd">@pytest.fixture</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">client</span><span class="p">():</span> </span></span><span class="line"><span class="cl"> <span class="s2">"""A FastAPI TestClient with DBOS launch mocked out — no Postgres needed."""</span> </span></span><span class="line"><span class="cl"> <span class="k">with</span> <span class="n">patch</span><span class="p">(</span><span class="s2">"backend.services.ingest.dbos"</span><span class="p">)</span> <span class="k">as</span> <span class="n">mock_dbos</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">mock_dbos</span><span class="o">.</span><span class="n">launch</span> <span class="o">=</span> <span class="n">MagicMock</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="kn">from</span> <span class="nn">backend.main</span> <span class="kn">import</span> <span class="n">app</span> </span></span><span class="line"><span class="cl"> <span class="k">with</span> <span class="n">TestClient</span><span class="p">(</span><span class="n">app</span><span class="p">)</span> <span class="k">as</span> <span class="n">c</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">yield</span> <span class="n">c</span> </span></span></code></pre></div><p>The decorated workflow steps still execute as ordinary Python functions — we lose the durability semantics, but the tests aren’t testing durability, they’re testing the <em>business logic inside the steps</em> (hash detection, extraction, chunking). The durability layer has its own tests upstream, in DBOS’s own suite.</p> <p>There’s a second isolation layer that runs on <strong>every</strong> test automatically: an autouse fixture points the docs folder, FAISS index, and metadata file at a per-test <code>tmp_path</code> via environment variables, so no test can ever touch real data on disk.</p> <h2 id="real-sqlite-with-one-override">Real SQLite, with one override</h2> <p>Progress tracking, achievements, quiz storage, deck CRUD — all SQLite. The progress tracker exposes a single test seam: a module-level path override.</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Real pattern from test_quiz.py</span> </span></span><span class="line"><span class="cl"><span class="nd">@pytest.fixture</span><span class="p">(</span><span class="n">autouse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">_isolate_progress_db</span><span class="p">(</span><span class="n">tmp_path</span><span class="p">,</span> <span class="n">monkeypatch</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">monkeypatch</span><span class="o">.</span><span class="n">setattr</span><span class="p">(</span><span class="n">progress_tracker</span><span class="p">,</span> <span class="s2">"_db_path_override"</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nb">str</span><span class="p">(</span><span class="n">tmp_path</span> <span class="o">/</span> <span class="s2">"progress_test.db"</span><span class="p">))</span> </span></span></code></pre></div><p>Every test gets a fresh database file; the schema auto-creates on first use. No connection pooling drama, no leaked state between tests, no in-memory <code>:memory:</code> gymnastics. Just a temp file per test.</p> <p>This is the kind of test that catches bugs an SQL-level mock would never see — a missing index, a botched migration, a constraint that doesn’t fire. SQLite is fast enough on every machine I’ve ever owned that “use the real database” isn’t even a trade-off.</p> <h2 id="the-testclient-pattern">The TestClient pattern</h2> <p>For HTTP tests, FastAPI’s <code>TestClient</code> runs the app in-process. The upload, the validation, the chunking, the vector-store update, the response serialisation — every layer runs for real. Only the calls that would leave the process (the Ollama embedding call inside ingestion, the model call inside generation) are patched. That’s the right line: the test verifies the <em>integration</em> of those layers, but doesn’t depend on an external service.</p> <p>The streaming endpoint tests use a slightly different style — they iterate the response body and parse each <strong>NDJSON</strong> line (one JSON envelope per line, as described in ) — but the principle is identical.</p> <h2 id="coverage-gaps-i-accept">Coverage gaps I accept</h2> <p>Three things the test suite <em>doesn’t</em> cover:</p> <ol> <li><strong>The frontend.</strong> No React testing in this suite — that’s a separate concern. Most failures show up in API tests anyway, because the frontend is a thin client over a typed API.</li> <li><strong>Real Ollama prompt quality.</strong> Whether <code>gemma4:e4b</code> actually produces <em>useful</em> quiz questions is not a thing tests can answer. That’s evaluation, not testing. It belongs in a separate harness with a real model running.</li> <li><strong>Race conditions across DBOS workflow restarts.</strong> The resume path is exercised at the logic level, but the full state space of “what happens if Postgres goes away at this exact instant” is too large to enumerate.</li> </ol> <p>These are conscious gaps. The test suite is for catching regressions in code I wrote; it’s not a replacement for evaluation, integration testing, or actual chaos engineering.</p> <h2 id="what-the-suite-is-actually-for">What the suite is actually for</h2> <p>Two things, in order:</p> <ol> <li><strong>Refactor confidence.</strong> When I rip out the agent loop and put a new one in, do the tests still pass? If yes, the API contracts I care about haven’t drifted.</li> <li><strong>PR review surface.</strong> Every PR runs the suite in CI. A green run is a precondition for merge. The suite is loud enough that a real regression makes the noise.</li> </ol> <p>Notice what it <em>isn’t</em> for: proving the model works. It can’t. Tests can pin behaviour but they can’t pin quality. That’s a different muscle, and it belongs in a different harness.</p> <h2 id="whats-worth-borrowing">What’s worth borrowing</h2> <p>If you’re building a local-AI app and your tests need Ollama running:</p> <ul> <li>Patch the <code>ollama</code> module at each service’s import site with <code>patch.object(service_module, "ollama")</code> — one seam per service, no shims required.</li> <li>Give your DB layer a path override and run against a <code>tmp_path</code> SQLite file.</li> <li>Use an autouse fixture to redirect every on-disk artefact (docs folder, index files) to <code>tmp_path</code>, so no test can touch real data even by accident.</li> <li>For each external service (model, audio, workflow engine), draw the seam at the process boundary. Test everything above it with real code.</li> </ul> <p>The result is a suite where every test runs in any environment, finishes in milliseconds, and exercises the actual integration of every layer of code you wrote. 351 tests in about three seconds isn’t an optimisation, it’s a side-effect of mocking only at the edges.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>CI</strong></td> <td>Continuous Integration</td> <td>Automatically running the test suite on every push/PR</td> </tr> <tr> <td><strong>PR</strong></td> <td>Pull Request</td> <td>A proposed code change — merged only when the suite is green</td> </tr> <tr> <td><strong>API</strong></td> <td>Application Programming Interface</td> <td>The HTTP surface the TestClient exercises in-process</td> </tr> <tr> <td><strong>HTTP</strong></td> <td>HyperText Transfer Protocol</td> <td>The protocol the (in-process) endpoint tests speak</td> </tr> <tr> <td><strong>RAG</strong></td> <td>Retrieval-Augmented Generation</td> <td>The retrieval-then-answer pipeline under test</td> </tr> <tr> <td><strong>KB</strong></td> <td>Knowledge Base</td> <td>The indexed document collection</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>Real in tests — it’s an in-process library</td> </tr> <tr> <td><strong>BM25</strong></td> <td>Best Match 25</td> <td>The keyword index — also real in tests</td> </tr> <tr> <td><strong>RRF</strong></td> <td>Reciprocal Rank Fusion</td> <td>The rank-merging formula covered by <code>test_vector_db.py</code></td> </tr> <tr> <td><strong>SQLite / SQL</strong></td> <td>(SQL = Structured Query Language)</td> <td>The real, file-based database every progress test runs against</td> </tr> <tr> <td><strong>DBOS</strong></td> <td>Database-Oriented Operating System</td> <td>The durable-workflow library — patched so no Postgres is needed</td> </tr> <tr> <td><strong>OCR</strong></td> <td>Optical Character Recognition</td> <td>The scanned-PDF fallback with its own trigger-threshold tests</td> </tr> <tr> <td><strong>SSRF</strong></td> <td>Server-Side Request Forgery</td> <td>The URL-import attack class covered in <code>test_docx_url.py</code></td> </tr> <tr> <td><strong>NDJSON</strong></td> <td>Newline-Delimited JSON</td> <td>The streaming format the endpoint tests parse line by line</td> </tr> <tr> <td><strong>SHA-256</strong></td> <td>Secure Hash Algorithm, 256-bit</td> <td>The content fingerprint behind the re-ingest tests</td> </tr> <tr> <td><strong>CRUD</strong></td> <td>Create, Read, Update, Delete</td> <td>The basic storage operations for decks, quizzes, and maps</td> </tr> <tr> <td><strong>PDF / DOCX / PPTX / XLSX / HTML</strong></td> <td>Portable Document Format / Word / PowerPoint / Excel / HyperText Markup Language</td> <td>The extractor formats with dedicated tests</td> </tr> </tbody> </table> <hr> <p>That’s the series. Eight posts on the parts of I’m most proud of — and a handful I’d build differently. If any of it was useful to you, the code is open source at , and the is on YouTube.</p> <p>Your data. Your hardware. Your AI. Your vault.</p> </article> <article> <h1>Part 7 · Gamifying Learning: 25 Badges, Idle-Gap Sessions, and a 90-Day Heatmap</h1> <p>Wed, 20 May 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>Part of a series on building . Previously: .</p> </blockquote> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>I taught secondary-school ICT for eight years before pivoting to full-stack development, and the most reliable lesson from that period was uncomfortably simple: <strong>students who showed up consistently learned. Students who didn’t, didn’t.</strong> Talent, prior knowledge, even motivation on any given day — all of it was downstream of attendance.</p> <p>CogniVault’s “Dashboard” tab is a small attempt to engineer for that. It’s not a Duolingo streak panic machine. It’s three things:</p> <ul> <li><strong>Hero stats</strong> — total study time, total sessions, current streak.</li> <li><strong>25 achievement badges</strong> — auto-tracked across chat, quizzes, workshops, flashcards, mindmaps.</li> <li><strong>A 90-day activity heatmap</strong> — GitHub-style, five purple intensity levels.</li> </ul> <p>The whole thing is a small SQLite table-set and a few React components. The interesting part isn’t the code, though — it’s the design calls.</p> <h2 id="idle-gap-sessions">Idle-gap sessions</h2> <p>The hardest question turned out to be the simplest-sounding: <strong>what counts as one study session?</strong></p> <p>The naive answer is “anything bookended by open/close of the app.” That’s wrong. People leave tabs open. People bounce away for an hour and come back. People open the app at 9am, do nothing, and check in at 2pm.</p> <p>The answer I landed on: a session ends when you’ve been <strong>idle for 15 minutes</strong>. Ask a question, idle for 16 minutes — that’s one session. Come back, ask another — that starts a new one. The threshold is configurable via <code>STUDY_SESSION_IDLE_GAP_SECONDS=900</code>.</p> <p>The clock keys off <strong>chat messages</strong> — the conversational core of studying in CogniVault. Every message either extends the open session (bumping its <code>ended_at</code> timestamp and message count) or, if the gap since the last activity exceeds the threshold, closes it implicitly and opens a new one:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Simplified from backend/services/progress_tracker.py</span> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">record_message</span><span class="p">(</span><span class="n">now</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">idle_gap</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">last</span> <span class="o">=</span> <span class="n">most_recent_session</span><span class="p">()</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">last</span> <span class="ow">and</span> <span class="p">(</span><span class="n">now</span> <span class="o">-</span> <span class="n">last</span><span class="o">.</span><span class="n">ended_at</span><span class="p">)</span> <span class="o"><=</span> <span class="n">idle_gap</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">extend</span><span class="p">(</span><span class="n">last</span><span class="p">,</span> <span class="n">ended_at</span><span class="o">=</span><span class="n">now</span><span class="p">)</span> <span class="c1"># same session continues</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">open_session</span><span class="p">(</span><span class="n">started_at</span><span class="o">=</span><span class="n">now</span><span class="p">,</span> <span class="n">ended_at</span><span class="o">=</span><span class="n">now</span><span class="p">)</span> <span class="c1"># new session begins</span> </span></span></code></pre></div><p>Two writes per message. A session’s duration is <code>ended_at - started_at</code>, which means “total time” reflects <em>engaged</em> time, not “had a tab open.” Which is the only number that actually means anything. (Study Hub actions — quiz attempts, card flips, mindmap exports — are recorded as their own events and feed the badge metrics below; the session clock itself stays message-driven and honest.)</p> <h2 id="25-badges-not-250">25 badges, not 250</h2> <p>Most gamified apps absolutely flood you with achievements. There’s a reason: more badges, more dopamine, more daily active users. The cost is that each badge means less — eventually the whole layer becomes wallpaper.</p> <p>I capped CogniVault at <strong>25</strong>, split across the five activity surfaces:</p> <ul> <li>10 for <strong>chat & study habits</strong> (first question, 10 messages in a day, 100 total, an hour of total study, 3- and 7-day streaks, a 30-minute deep-dive session, night-owl and early-bird sessions, first use of the scope filter)</li> <li>4 for <strong>quizzes</strong> (first quiz, a perfect score, passing on advanced difficulty, 10 quizzes)</li> <li>4 for <strong>workshops</strong> (first outline, first completed lesson, first completed workshop, 5 completed)</li> <li>4 for <strong>flashcards</strong> (first deck, 50 card flips, fully mastering a deck, 5 decks)</li> <li>3 for <strong>mindmaps</strong> (first mindmap, first export, 5 mindmaps)</li> </ul> <p>Each badge has a one-line unlock criterion that’s auto-evaluated on relevant events. Nothing manual, nothing the user has to “claim.” They just appear.</p> <p>And the definitions aren’t code at all — they’re <strong>data</strong>. All 25 live in a single JSON file, each entry naming the metric it watches and the target to hit:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-json" data-lang="json"><span class="line"><span class="cl"><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="nt">"code"</span><span class="p">:</span> <span class="s2">"card_reviewer"</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nt">"name"</span><span class="p">:</span> <span class="s2">"Card Reviewer"</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nt">"icon"</span><span class="p">:</span> <span class="s2">"🃏"</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nt">"metric"</span><span class="p">:</span> <span class="s2">"total_card_flips"</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nt">"target"</span><span class="p">:</span> <span class="mi">50</span> </span></span><span class="line"><span class="cl"><span class="p">}</span> </span></span></code></pre></div><p>A single evaluator reads the current stats, compares every definition against its metric, diffs against already-earned badges, and inserts new unlocks into <code>progress.db</code>. Adding badge number 26 means adding a JSON entry, not writing new logic. Several badges form ladders — each knows which badge is its “next level,” which powers the detail view’s nudge toward the next goal.</p> <h2 id="the-heatmap">The heatmap</h2> <p>The 90-day heatmap is the part I’m proudest of, and also the simplest. It’s a 13×7 grid of cells, one per day, coloured by total study time that day.</p> <p>Five intensity levels:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">level 0 — no activity </span></span><span class="line"><span class="cl">level 1 — under 15 minutes (a quick check-in) </span></span><span class="line"><span class="cl">level 2 — 15-60 minutes (a focused session) </span></span><span class="line"><span class="cl">level 3 — 1-3 hours (substantial study) </span></span><span class="line"><span class="cl">level 4 — 3+ hours (a marathon) </span></span></code></pre></div><p>The data is conceptually one aggregation over the sessions table:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="k">SELECT</span><span class="w"> </span><span class="nb">date</span><span class="p">(</span><span class="n">started_at</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="k">day</span><span class="p">,</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="w"> </span><span class="k">SUM</span><span class="p">(</span><span class="n">ended_at</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">started_at</span><span class="p">)</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">seconds</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="k">FROM</span><span class="w"> </span><span class="n">study_sessions</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="k">WHERE</span><span class="w"> </span><span class="n">started_at</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="nb">date</span><span class="p">(</span><span class="s1">'now'</span><span class="p">,</span><span class="w"> </span><span class="s1">'-90 days'</span><span class="p">)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="k">day</span><span class="p">;</span><span class="w"> </span></span></span></code></pre></div><p>The backend zero-fills the missing days so the frontend always receives exactly 90 entries, and a small client-side function bins each day’s total into the five levels. Click any cell and a <code>DayDetailModal</code> opens with that day’s numbers — study time, sessions, messages — plus any badges earned that day.</p> <p>The reason I love this component: it makes the <em>texture</em> of a study habit visible. Streaks are great, but a streak is one number. A heatmap shows you that you study harder on weekends, or that you’ve been on a slow drift downward all month, or that the gap between your last “level 4 day” and today is longer than you thought. It reflects something the user can act on.</p> <h2 id="what-i-deliberately-left-out">What I deliberately left out</h2> <p>Three things you’d find in most gamified apps, intentionally absent from CogniVault:</p> <ol> <li> <p><strong>Streak panic.</strong> No “your streak is in danger!” pop-up. No streak freeze rules. No yellow exclamation marks. The streak is shown — that’s the entirety of the feedback loop. If a user breaks their streak, they break their streak. Adults don’t need shaming UX.</p> </li> <li> <p><strong>Leaderboards.</strong> This is a single-user, fully local app. There’s no global comparison. (And there shouldn’t be — leaderboards optimise the wrong thing for studying.)</p> </li> <li> <p><strong>Confetti, fanfare, push notifications.</strong> A newly earned badge shows up in the quiz results screen and on the dashboard grid. That’s the entire celebration. Anything bigger is theft of the user’s attention for the app’s benefit, not theirs.</p> </li> </ol> <p>The general principle: <strong>measure what matters, surface it without nagging.</strong> Notice that you came back. Reflect that back to you. Don’t pretend you care more than you do.</p> <h2 id="what-the-dashboard-doesn-try-to-optimise">What the dashboard <em>doesn’t</em> try to optimise</h2> <p>A common trap with these dashboards is reverse-causation: the user starts gaming the metric instead of doing the underlying thing. A daily-question count, for instance, gets you users who ask one filler question per day to keep their streak alive.</p> <p>So the bar is deliberately low in exactly one place and high everywhere else. There is <em>one</em> zero-effort badge — “First Question,” earned on your very first message — because every game needs an on-ramp that proves the system works. After that, the metrics get hard to game without doing the actual work:</p> <ul> <li><strong>Total study time</strong> — only accrues during active engagement, with idle-gap cutoffs.</li> <li><strong>Sessions</strong> — adding more requires actually starting separate work periods.</li> <li><strong>Badges</strong> — almost all require depth (100 messages, ace a quiz, master a deck, complete 5 workshops), not just touch.</li> <li><strong>Heatmap intensity</strong> — needs sustained engagement on a given day.</li> </ul> <h2 id="implementation-small-on-purpose">Implementation: small on purpose</h2> <p>The gamification core is three SQLite tables —</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sql" data-lang="sql"><span class="line"><span class="cl"><span class="n">study_sessions</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">started_at</span><span class="p">,</span><span class="w"> </span><span class="n">ended_at</span><span class="p">,</span><span class="w"> </span><span class="n">message_count</span><span class="p">)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="n">message_events</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">sent_at</span><span class="p">,</span><span class="w"> </span><span class="n">session_id</span><span class="p">,</span><span class="w"> </span><span class="n">had_scope_filter</span><span class="p">,</span><span class="w"> </span><span class="n">had_attachments</span><span class="p">)</span><span class="w"> </span></span></span><span class="line"><span class="cl"><span class="n">achievements_earned</span><span class="w"> </span><span class="p">(</span><span class="n">code</span><span class="p">,</span><span class="w"> </span><span class="n">earned_at</span><span class="p">)</span><span class="w"> </span></span></span></code></pre></div><p>— plus the JSON badge definitions, one evaluator module, and a handful of React components (<code>SummaryCards</code>, <code>AchievementGrid</code>, <code>ActivityHeatmap</code>, <code>DayDetailModal</code>). The same <code>progress.db</code> file has since grown more tables for the Study Hub’s saved quizzes, workshops, decks, and mindmaps — but the badge-and-session machinery itself remains a couple hundred lines.</p> <p>There’s nothing fancy in any of it. The dashboard works because the <em>design calls</em> are right, not because the implementation is clever.</p> <h2 id="takeaway">Takeaway</h2> <p>If you’re building a learning tool — or any tool that lives on user habit — gamify <em>consciously</em>. Pick the metrics that reflect what you actually want to encourage. Cap your achievement surface. Skip the streak-panic UX. Make the texture of usage visible without making the app desperate.</p> <p>Or, more bluntly: don’t build Duolingo. Build a dashboard the user occasionally glances at and then closes, feeling slightly more inclined to keep going. That’s the whole job.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>UX</strong></td> <td>User Experience</td> <td>How the product feels — the thing streak-panic mechanics sacrifice</td> </tr> <tr> <td><strong>ICT</strong></td> <td>Information and Communications Technology</td> <td>The subject I taught for eight years before going full-stack</td> </tr> <tr> <td><strong>SQLite</strong></td> <td>(SQL = Structured Query Language)</td> <td>A complete relational database living in one file, <code>progress.db</code></td> </tr> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The data format the 25 badge definitions live in</td> </tr> <tr> <td><strong>UI</strong></td> <td>User Interface</td> <td>The dashboard surface: stats, grid, heatmap</td> </tr> </tbody> </table> <hr> <p><strong>Next up:</strong> .</p> </article> <article> <h1>Part 6 · The Mindmap Renderer: What Hand-Rolling SVG Taught Me (and Why v2 Uses React Flow)</h1> <p>Fri, 15 May 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>Part of a series on building . Previously: .</p> </blockquote> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>CogniVault’s Study Hub has four modes. Three of them — Quiz, Workshop, Flashcards — are list-shaped. The fourth, <strong>Mindmap</strong>, isn’t. It’s a tree of concepts radiating from a central topic, and I wanted it to be:</p> <ul> <li>Visually clean enough that the user actually wants to look at it.</li> <li>Interactive: pan, zoom, explore.</li> <li>Exportable to PNG and PDF with high fidelity.</li> </ul> <p>This post is the honest version of how that renderer got built — <strong>twice</strong>. Version one was hand-rolled SVG with no graph library, and it shipped. Version two, the one in the codebase today, is built on <code>@xyflow/react</code> (React Flow) and a dagre auto-layout. I think both decisions were correct <em>at the time they were made</em>, and the journey between them taught me more about build-vs-buy than either version alone.</p> <h2 id="round-one-hand-rolling-it">Round one: hand-rolling it</h2> <p>My first instinct, like everyone else’s, was to reach for a library on day one. I resisted, for reasons that were sound: the default styling would need full customisation anyway, the layout I wanted was simple, export would need extra dependencies regardless, and the bundle cost wasn’t nothing. For the small trees Gemma generates, SVG alone looked sufficient — it pans and zooms with the <code>viewBox</code> attribute, draws arbitrary shapes, serialises to a string, and rasterises cleanly.</p> <p>So v1 was pure SVG. And the core of it really was small.</p> <h3 id="radial-layout-in-40-lines">Radial layout in 40 lines</h3> <p>A radial layout places the root at the centre and arranges children in concentric rings:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-ts" data-lang="ts"><span class="line"><span class="cl"><span class="kr">type</span> <span class="nx">Node</span> <span class="o">=</span> <span class="p">{</span> <span class="nx">id</span>: <span class="kt">string</span><span class="p">;</span> <span class="nx">label</span>: <span class="kt">string</span><span class="p">;</span> <span class="nx">children</span>: <span class="kt">Node</span><span class="p">[]</span> <span class="p">};</span> </span></span><span class="line"><span class="cl"><span class="kr">type</span> <span class="nx">Placed</span> <span class="o">=</span> <span class="nx">Node</span> <span class="o">&</span> <span class="p">{</span> <span class="nx">x</span>: <span class="kt">number</span><span class="p">;</span> <span class="nx">y</span>: <span class="kt">number</span><span class="p">;</span> <span class="nx">angle</span>: <span class="kt">number</span> <span class="p">};</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="kd">function</span> <span class="nx">layout</span><span class="p">(</span><span class="nx">root</span>: <span class="kt">Node</span><span class="p">,</span> <span class="nx">radiusStep</span> <span class="o">=</span> <span class="mi">180</span><span class="p">)</span><span class="o">:</span> <span class="nx">Placed</span><span class="p">[]</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="kr">const</span> <span class="nx">placed</span>: <span class="kt">Placed</span><span class="p">[]</span> <span class="o">=</span> <span class="p">[];</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="kd">function</span> <span class="nx">place</span><span class="p">(</span> </span></span><span class="line"><span class="cl"> <span class="nx">node</span>: <span class="kt">Node</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">depth</span>: <span class="kt">number</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">fromAngle</span>: <span class="kt">number</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">toAngle</span>: <span class="kt">number</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">(</span><span class="nx">depth</span> <span class="o">===</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="nx">placed</span><span class="p">.</span><span class="nx">push</span><span class="p">({</span> <span class="p">...</span><span class="nx">node</span><span class="p">,</span> <span class="nx">x</span>: <span class="kt">0</span><span class="p">,</span> <span class="nx">y</span>: <span class="kt">0</span><span class="p">,</span> <span class="nx">angle</span>: <span class="kt">0</span> <span class="p">});</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="kr">const</span> <span class="nx">angle</span> <span class="o">=</span> <span class="p">(</span><span class="nx">fromAngle</span> <span class="o">+</span> <span class="nx">toAngle</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="nx">placed</span><span class="p">.</span><span class="nx">push</span><span class="p">({</span> </span></span><span class="line"><span class="cl"> <span class="p">...</span><span class="nx">node</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">x</span>: <span class="kt">depth</span> <span class="o">*</span> <span class="nx">radiusStep</span> <span class="o">*</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">cos</span><span class="p">(</span><span class="nx">angle</span><span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="nx">y</span>: <span class="kt">depth</span> <span class="o">*</span> <span class="nx">radiusStep</span> <span class="o">*</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">sin</span><span class="p">(</span><span class="nx">angle</span><span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="nx">angle</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="p">});</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="kr">const</span> <span class="nx">slice</span> <span class="o">=</span> <span class="p">(</span><span class="nx">toAngle</span> <span class="o">-</span> <span class="nx">fromAngle</span><span class="p">)</span> <span class="o">/</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">max</span><span class="p">(</span><span class="nx">node</span><span class="p">.</span><span class="nx">children</span><span class="p">.</span><span class="nx">length</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="nx">node</span><span class="p">.</span><span class="nx">children</span><span class="p">.</span><span class="nx">forEach</span><span class="p">((</span><span class="nx">child</span><span class="p">,</span> <span class="nx">i</span><span class="p">)</span> <span class="o">=></span> </span></span><span class="line"><span class="cl"> <span class="nx">place</span><span class="p">(</span> </span></span><span class="line"><span class="cl"> <span class="nx">child</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">depth</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">fromAngle</span> <span class="o">+</span> <span class="nx">i</span> <span class="o">*</span> <span class="nx">slice</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">fromAngle</span> <span class="o">+</span> <span class="p">(</span><span class="nx">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="nx">slice</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="p">),</span> </span></span><span class="line"><span class="cl"> <span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="nx">place</span><span class="p">(</span><span class="nx">root</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">2</span> <span class="o">*</span> <span class="nb">Math</span><span class="p">.</span><span class="nx">PI</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="nx">placed</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="p">}</span> </span></span></code></pre></div><p>Each level inherits an angular slice from its parent and subdivides it among its children. Pan and zoom were pure <code>viewBox</code> arithmetic — no transform matrices, no event library, just numbers. Edges were quadratic Bézier curves pulled toward the centre. It looked good, it was fast, and the whole renderer fit comfortably in one component.</p> <h3 id="the-export-trick-thats-still-worth-knowing">The export trick that’s still worth knowing</h3> <p>To export an SVG to PNG with zero dependencies, the browser does all the work:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">SVG DOM ─► XMLSerializer ─► string ─► <img> ─► <canvas> ─► PNG blob </span></span></code></pre></div><p>Serialise the SVG to a string, load it into an <code>Image</code>, draw that image onto a scaled <code><canvas></code>, and ask the canvas for a PNG. Fonts, anti-aliasing, <code>currentColor</code> — the browser resolves all of it natively. If your visual <em>is</em> an SVG element, this is still the cleanest export pipeline there is, and I’d use it again without hesitation.</p> <p>One v1 detail survived to the present day completely unchanged: the save flow. Instead of the classic “straight to the Downloads folder” experience, exports go through the <strong>File System Access API</strong> (<code>showSaveFilePicker</code>) where the browser supports it, with an anchor-tag download fallback for Firefox and Safari. A real “Save As…” dialog, no Electron required. That helper (<code>lib/saveBlob.ts</code>) now serves the quiz and workshop exports too.</p> <h2 id="the-requirements-that-broke-v1">The requirements that broke v1</h2> <p>Then the feature met its users (well — met me, using it seriously while studying), and three requirements emerged that the elegant hand-rolled version handled badly:</p> <ol> <li> <p><strong>“Let me move that node.”</strong> A generated layout is a starting point; a <em>useful</em> mindmap is one you rearrange to match how you think. v1’s nodes were fixed in their computed positions. Adding drag meant building hit-testing, drag state, and position persistence from scratch — exactly the unglamorous interaction machinery that graph libraries exist to provide.</p> </li> <li> <p><strong>Text wanted to be HTML.</strong> SVG <code><text></code> doesn’t wrap. Long concept labels needed manual line-breaking, measuring, and truncation — a constant fight. HTML nodes (real <code><div></code>s with CSS) wrap, ellipsize, and theme for free.</p> </li> <li> <p><strong>Radial wasn’t the best reading layout after all.</strong> For the wide-and-shallow trees Gemma actually generates, a left-to-right or top-down tree (the kind a layout engine like <strong>dagre</strong> computes) reads better than rings. And once layouts became switchable, “auto-layout plus remembered manual tweaks” became the natural model.</p> </li> </ol> <p>I could have built all of that on the SVG foundation. But look at the list: viewport management, node dragging, HTML nodes inside a graph canvas, pluggable layouts. That is <em>precisely</em> React Flow’s feature set. In v1, the library would have been a wrapper around things I didn’t need. By v2, my requirements had grown into exactly the things it does well.</p> <p>So I changed my mind.</p> <h2 id="round-two-react-flow--dagre">Round two: React Flow + dagre</h2> <p>Today’s renderer (<code>frontend/src/components/study/mindmaps/</code>):</p> <ul> <li><strong><code>@xyflow/react</code> (React Flow)</strong> provides the canvas: native pan/zoom, draggable nodes, a minimap-style controls cluster, and dark-mode support via <code>colorMode</code>.</li> <li><strong>dagre</strong> computes the automatic layout, with a user-facing toggle between left-to-right and top-down. The tree-to-graph conversion is a small pure function.</li> <li><strong>Custom HTML nodes</strong> carry the design system: the root gets a gradient, themes get a tint, leaves stay subtle — and text wraps like text should.</li> <li><strong>Dragged positions persist.</strong> Moving a node fires a save; reopening the map restores your arrangement. A “Reset layout” button clears the saved positions and returns to the dagre auto-layout. The layout choice and positions live with the mindmap in SQLite.</li> <li><strong>Export adapted to the new reality.</strong> The nodes are HTML now, so the v1 SVG-serialisation trick no longer applies. PNG export uses <code>html-to-image</code> over the React Flow viewport, framed to the node bounds regardless of current zoom; PDF embeds that PNG via a lazy-loaded <code>jsPDF</code>; Markdown export is a zero-dependency recursive walk of the tree. Yes — v2 uses the exact library (<code>html-to-image</code>) I was proud of avoiding in v1. The requirements changed; the trade-off changed with them.</li> </ul> <h2 id="what-the-journey-actually-taught-me">What the journey actually taught me</h2> <p>I went back and forth on how to write this post, because v1’s story (“look how little code you need!”) is more flattering. But the two-version truth is the more useful lesson:</p> <ol> <li> <p><strong>Hand-rolling first was still right.</strong> v1 shipped in a weekend, taught me the problem’s real shape (layout, viewport, export are separate concerns), and cost nothing to throw away because it was small. If I’d started with React Flow, I’d have configured a library before understanding the problem.</p> </li> <li> <p><strong>Libraries earn their place when your requirements converge on their feature set — not before.</strong> The moment “drag nodes and remember where I put them” became a requirement, the build-vs-buy maths flipped completely.</p> </li> <li> <p><strong>Some pieces outlive the rewrite.</strong> The save-dialog helper, the Markdown walk, the instinct to frame exports to content bounds — all carried over. Rewrites are rarely total.</p> </li> <li> <p><strong>The browser’s native pipelines are worth knowing even when you end up not using them.</strong> SVG → canvas → PNG is still the best zero-dependency export trick in frontend development. It just stops applying the day your nodes become HTML.</p> </li> </ol> <h2 id="takeaway">Takeaway</h2> <p>“Build or buy” is a function of requirements — and requirements move. Build while the problem is small and you’re still learning its shape. Buy when your feature list starts reading like the library’s README. And when you switch, write down why, so the next person (or the next you) knows it wasn’t indecision. It was the plan growing up.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>SVG</strong></td> <td>Scalable Vector Graphics</td> <td>The browser’s built-in vector drawing format — v1’s entire foundation</td> </tr> <tr> <td><strong>PNG</strong></td> <td>Portable Network Graphics</td> <td>The raster image format exports produce</td> </tr> <tr> <td><strong>PDF</strong></td> <td>Portable Document Format</td> <td>The print-ready export, built by embedding the PNG</td> </tr> <tr> <td><strong>DOM</strong></td> <td>Document Object Model</td> <td>The browser’s live representation of the page — what <code>html-to-image</code> rasterises in v2</td> </tr> <tr> <td><strong>HTML / CSS</strong></td> <td>HyperText Markup Language / Cascading Style Sheets</td> <td>What v2’s nodes are made of — and why their text wraps for free</td> </tr> <tr> <td><strong>API</strong></td> <td>Application Programming Interface</td> <td>As in the File System Access API, which provides real “Save As…” dialogs</td> </tr> <tr> <td><strong>UI / UX</strong></td> <td>User Interface / User Experience</td> <td>The drag-a-node requirement that triggered the rewrite</td> </tr> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The tree structure Gemma generates for each mindmap (see the previous post)</td> </tr> <tr> <td><strong>SQLite</strong></td> <td>(SQL = Structured Query Language)</td> <td>The single-file database where layout choices and node positions persist</td> </tr> </tbody> </table> <hr> <p><strong>Next up:</strong> .</p> </article> <article> <h1>Part 5 · Getting Reliable JSON Out of a Local LLM</h1> <p>Sun, 10 May 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>Part of a series on building . Previously: .</p> </blockquote> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>CogniVault’s Study Hub generates four kinds of structured artefacts from your documents: quizzes, multi-lesson workshops, flashcard decks, and mindmaps. All four need the model to return structured JSON, not prose. All four ride on Gemma 4 running locally via Ollama. And all four would fail far too often if I trusted the model to “just return JSON.”</p> <p>Here’s the defensive pattern that brings that failure rate close to zero — and what to do about the cases that still get through.</p> <h2 id="the-pattern">The pattern</h2> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="mf">1.</span> <span class="n">Retrieve</span> <span class="err">→</span> <span class="n">hybrid</span> <span class="n">search</span> <span class="n">restricted</span> <span class="n">by</span> <span class="n">user</span><span class="o">-</span><span class="n">selected</span> <span class="n">scope</span> </span></span><span class="line"><span class="cl"><span class="mf">2.</span> <span class="n">Prompt</span> <span class="err">→</span> <span class="n">strict</span> <span class="n">schema</span><span class="o">-</span><span class="n">by</span><span class="o">-</span><span class="n">example</span> <span class="n">with</span> <span class="n">explicit</span> <span class="n">count</span> <span class="o">+</span> <span class="n">shape</span> <span class="n">rules</span> </span></span><span class="line"><span class="cl"><span class="mf">3.</span> <span class="n">Generate</span> <span class="err">→</span> <span class="n">ollama</span><span class="o">.</span><span class="n">chat</span> <span class="n">with</span> <span class="n">format</span><span class="o">=</span><span class="s2">"json"</span> <span class="p">(</span><span class="n">grammar</span><span class="o">-</span><span class="n">constrained</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="mf">4.</span> <span class="n">Parse</span> <span class="err">→</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">,</span> <span class="n">tolerant</span> <span class="n">of</span> <span class="n">object</span> <span class="o">/</span> <span class="n">array</span> <span class="o">/</span> <span class="n">fenced</span> <span class="n">shapes</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">with</span> <span class="n">a</span> <span class="n">trailing</span><span class="o">-</span><span class="n">comma</span> <span class="n">repair</span> <span class="k">pass</span> </span></span><span class="line"><span class="cl"><span class="mf">5.</span> <span class="n">Validate</span> <span class="err">→</span> <span class="n">drop</span> <span class="n">malformed</span> <span class="n">items</span> <span class="n">rather</span> <span class="n">than</span> <span class="n">fail</span> <span class="n">the</span> <span class="n">whole</span> <span class="n">batch</span> </span></span><span class="line"><span class="cl"><span class="mf">6.</span> <span class="n">Retry</span> <span class="err">→</span> <span class="n">the</span> <span class="n">workshop</span> <span class="n">outline</span> <span class="n">retries</span> <span class="n">once</span> <span class="n">with</span> <span class="n">a</span> <span class="n">stronger</span> <span class="n">prompt</span> </span></span><span class="line"><span class="cl"><span class="mf">7.</span> <span class="n">Persist</span> <span class="err">→</span> <span class="n">SQLite</span> <span class="p">(</span><span class="n">progress</span><span class="o">.</span><span class="n">db</span><span class="p">)</span> <span class="n">so</span> <span class="n">the</span> <span class="n">user</span> <span class="n">can</span> <span class="n">come</span> <span class="n">back</span> <span class="n">later</span> </span></span></code></pre></div><p>Every generator in CogniVault follows it. The interesting moves are 2, 4, and 5.</p> <h2 id="step-3-formatjson-does-real-work">Step 3: <code>format="json"</code> does real work</h2> <p>Ollama exposes a <code>format="json"</code> option that puts the model under a <strong>grammar constraint</strong> during sampling. The decoder won’t emit tokens that would make the output invalid JSON. It’s not perfect — schemas are bigger than “valid JSON,” and the model can still produce well-formed garbage — but it eliminates the entire class of “the model started writing prose before the closing brace” failures.</p> <p>If your local-LLM stack supports a grammar option (Ollama, llama.cpp, vLLM, etc.), turn it on. It’s not free (sampling is slightly slower) but the failure-mode improvement is enormous. Without it, you’ll spend most of your error budget on truncated objects.</p> <h2 id="step-2-schema-in-prompt-that-the-model-can-actually-obey">Step 2: schema-in-prompt that the model can actually obey</h2> <p><code>format="json"</code> guarantees the <em>shape</em> of the output is JSON. It says nothing about whether the JSON matches your domain schema. That’s the prompt’s job.</p> <p>The pattern that works for me: instead of dumping a formal JSON Schema and saying “obey this,” include a <strong>filled-in example</strong> that shows the model the exact shape, plus explicit counts. Here’s the heart of CogniVault’s real quiz template (it lives as an editable Markdown file in <code>backend/prompts/quiz.md</code>):</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Output ONLY a single JSON object — no prose, no markdown fences, </span></span><span class="line"><span class="cl">no text outside the JSON. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">NUMBER OF QUESTIONS: EXACTLY $num_questions. This is a hard requirement. </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">OUTPUT SCHEMA: </span></span><span class="line"><span class="cl">{ </span></span><span class="line"><span class="cl"> "questions": [ </span></span><span class="line"><span class="cl"> { </span></span><span class="line"><span class="cl"> "type": one of [$types_csv], </span></span><span class="line"><span class="cl"> "question": the question text (string, no leading numbering), </span></span><span class="line"><span class="cl"> "options": array of strings (length 4 for mcq, length 2 for true_false), </span></span><span class="line"><span class="cl"> "correct_index": integer index into options (0-based), </span></span><span class="line"><span class="cl"> "explanation": 1-2 sentence explanation of the correct answer </span></span><span class="line"><span class="cl"> }, </span></span><span class="line"><span class="cl"> ... exactly $num_questions entries </span></span><span class="line"><span class="cl"> ] </span></span><span class="line"><span class="cl">} </span></span></code></pre></div><p>A few choices that matter:</p> <ul> <li><strong>Show the shape, don’t describe it.</strong> “Each item has a <code>type</code> field” gets ignored more often than the literal example.</li> <li><strong>Pin the count.</strong> “EXACTLY 10” — repeated, in capitals, as a hard requirement — is much more reliable than “around 10.”</li> <li><strong>Index, don’t repeat.</strong> The correct answer is <code>correct_index</code>, an integer pointing into <code>options</code> — not the answer text again. Repeated text invites paraphrase drift (“Paris” vs “Paris, France”), and then your grading comparison breaks.</li> <li><strong>One artefact per call.</strong> I tried generating a full workshop (outline + every lesson) in one call. The model’s quality degrades sharply as the response grows. Splitting into outline-first, lesson-on-demand is the two-pass strategy below.</li> </ul> <h2 id="step-4-parse-tolerantly">Step 4: parse, tolerantly</h2> <p>Even with <code>format="json"</code>, two parsing problems survive in practice.</p> <p><strong>The shape surprise.</strong> This one bit me in production: I’d assumed the model would return a bare JSON array of questions. With <code>format="json"</code>, Gemma consistently returns an <strong>object</strong> — <code>{"questions": [...]}</code> — and for a while the parser only accepted the array. Result: a 502 on every quiz generation until I found it. The fix is a parser that meets the model where it is:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Simplified from backend/services/quiz_generator.py</span> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">extract_items</span><span class="p">(</span><span class="n">raw</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">list</span> <span class="o">|</span> <span class="kc">None</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">candidate</span> <span class="ow">in</span> <span class="p">(</span><span class="n">raw</span><span class="p">,</span> <span class="n">extract_json_object</span><span class="p">(</span><span class="n">raw</span><span class="p">),</span> <span class="n">extract_json_array</span><span class="p">(</span><span class="n">raw</span><span class="p">)):</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">candidate</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">continue</span> </span></span><span class="line"><span class="cl"> <span class="n">data</span> <span class="o">=</span> <span class="n">load_json_lenient</span><span class="p">(</span><span class="n">candidate</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nb">list</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">data</span> <span class="c1"># bare array</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nb">dict</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">items</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"questions"</span><span class="p">)</span> <span class="c1"># the expected object shape</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="nb">list</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">items</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="kc">None</span> </span></span></code></pre></div><p><strong>Lexical glitches.</strong> Occasionally a trailing comma slips through. The repair is deliberately narrow — one regex pass, then give up:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">load_json_lenient</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="k">try</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">except</span> <span class="n">json</span><span class="o">.</span><span class="n">JSONDecodeError</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">repaired</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s2">",(\s*[\]}])"</span><span class="p">,</span> <span class="sa">r</span><span class="s2">"\1"</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span> <span class="c1"># strip trailing commas</span> </span></span><span class="line"><span class="cl"> <span class="k">try</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">repaired</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">except</span> <span class="n">json</span><span class="o">.</span><span class="n">JSONDecodeError</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="kc">None</span> </span></span></code></pre></div><p>I don’t try to balance brackets, complete truncated strings, or guess at missing fields. Either the output is fixable with a trailing-comma pass and some substring extraction, or it isn’t, and we move to step 5.</p> <h2 id="step-5-drop-malformed-items-dont-fail-the-batch">Step 5: drop malformed items, don’t fail the batch</h2> <p>This is the call that took me a while to make peace with.</p> <p>When the model returns 10 quiz questions but #7 is missing its <code>options</code> field, the temptation is to error out and regenerate the whole batch. <em>Don’t</em>. Validate each item independently and drop the ones that fail.</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># CogniVault does this with explicit field checks into a dataclass;</span> </span></span><span class="line"><span class="cl"><span class="c1"># pydantic works just as well.</span> </span></span><span class="line"><span class="cl"><span class="n">questions</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="n">raw_item</span> <span class="ow">in</span> <span class="n">parsed_items</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">q</span> <span class="o">=</span> <span class="n">validate_item</span><span class="p">(</span><span class="n">raw_item</span><span class="p">,</span> <span class="n">allowed_types</span><span class="p">)</span> <span class="c1"># returns None if malformed</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">q</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">questions</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">q</span><span class="p">)</span> </span></span></code></pre></div><p>The user gets 9 questions instead of 10. They don’t notice. Re-running the whole generation to fix question #7 takes 30 seconds and might introduce <em>new</em> failures in questions 1-6. The dropped-item approach is strictly better UX. (The model also sometimes overshoots the count — the validated list is simply trimmed back to what was asked for.)</p> <h2 id="step-6-the-outline-retries-once">Step 6: the outline retries once</h2> <p>Workshops are the exception that proves the rule. A workshop is a structured outline (title, summary, lesson list) plus each lesson’s content. The outline <em>must</em> parse — there’s no partial success for a table of contents — so a parse failure there triggers exactly <strong>one</strong> retry, with the prompt re-sent plus a stern reminder: “Your previous response was unparseable. Output ONLY a single valid JSON object.” If the second attempt fails too, the user gets a clear error suggesting a narrower scope.</p> <p>One retry, not three. Three retries when the model is consistently confused is just wasted seconds and watts.</p> <p>The lessons themselves, interestingly, are <strong>not JSON at all</strong>. A lesson body is prose — forcing it into a JSON string would buy nothing and cost escaping headaches. Lessons are generated as plain Markdown, then run through a small cleanup pass that strips chat-isms the model sometimes adds despite instructions (“I hope this helps!”, “Let me know if…”). Different output, different contract.</p> <h2 id="two-pass-outline-first-lessons-on-demand">Two-pass: outline first, lessons on demand</h2> <p>Workshops use a two-pass generation pattern:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Pass 1 — generate outline: {"title": ..., "lessons": [{"title": ...}, ...]} (cheap, JSON) </span></span><span class="line"><span class="cl">Pass 2 — for each lesson: a full Markdown lesson body (on demand) </span></span></code></pre></div><p>The outline is fast and lets the user see the shape of the workshop immediately. Each lesson is generated when the user opens it — meaning the user is <em>reading</em> lesson 1 while deciding whether they even want lesson 5. The total wall-clock time to “first useful content” is small even for a 10-lesson workshop.</p> <p>This is the same architectural move the chat side makes with : split a slow operation into a tiny fast part and a larger slow part, hand the user the fast part immediately.</p> <h2 id="what-i-learned-so-far-putting-those-generators-together">What I learned so far putting those generators together</h2> <p>A few principles distilled from the four generators:</p> <ol> <li><strong>Use the grammar option in your inference stack.</strong> Don’t try to coax JSON out of a free-form decoder.</li> <li><strong>Pin every quantifier in the prompt.</strong> “Exactly 10,” “exactly 4 options,” “one or two sentences.” Vague counts = inconsistent output.</li> <li><strong>Don’t assume the top-level shape.</strong> Grammar-constrained Gemma likes objects; your code might expect arrays. Accept both — the parser is cheaper than relying on the model to return the expected shape.</li> <li><strong>Drop, don’t fail.</strong> Lossy success beats brittle perfection.</li> <li><strong>One retry, never more.</strong> If two tries can’t produce valid output, the prompt is wrong, not the model.</li> <li><strong>Split large generations.</strong> Outline + lessons. Skeleton + body. Two small calls beat one big one almost every time. And if a part of the output is naturally prose, let it <em>be</em> prose.</li> </ol> <p>Local LLMs in 2026 are good enough that structured generation is genuinely usable for production-shaped features. They are not so good that you can skip the defensive scaffolding. The scaffolding above is maybe 80 lines of code total across all four generators, and it’s the difference between “demo-quality” and “I trust this enough to ship.”</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The structured text format the generators must produce</td> </tr> <tr> <td><strong>LLM</strong></td> <td>Large Language Model</td> <td>A neural network trained on huge amounts of text that can read and generate language</td> </tr> <tr> <td><strong>AI</strong></td> <td>Artificial Intelligence</td> <td>Software performing tasks that normally need human intelligence</td> </tr> <tr> <td><strong>MCQ</strong></td> <td>Multiple-Choice Question</td> <td>One of the two quiz question types (the other is true/false)</td> </tr> <tr> <td><strong>UX</strong></td> <td>User Experience</td> <td>Why 9 valid questions beat a regeneration error</td> </tr> <tr> <td><strong>SQLite</strong></td> <td>(SQL = Structured Query Language)</td> <td>The single-file database where generated artefacts persist</td> </tr> <tr> <td><strong>DBOS</strong></td> <td>Database-Oriented Operating System</td> <td>The durable-workflow library from the previous post</td> </tr> <tr> <td><strong>HTTP 502</strong></td> <td>Bad Gateway (HyperText Transfer Protocol status code)</td> <td>The error my array-only parser produced until I accepted Gemma’s object shape</td> </tr> </tbody> </table> <hr> <p><strong>Next up:</strong> — what hand-rolling an SVG radial layout taught me, and why version two uses React Flow anyway.</p> </article> <article> <h1>Part 4 · Crash-Resumable Ingestion: DBOS, SHA-256, and Surviving a kill -9</h1> <p>Tue, 05 May 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>Part of a series on building . Previously: .</p> </blockquote> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>There are two things you absolutely don’t want your RAG ingestion pipeline to do:</p> <ol> <li>Re-embed a 200-page PDF because you fixed a typo on page 12.</li> <li>Lose its progress if you close the laptop lid halfway through.</li> </ol> <p>The first wastes time and compute resources. The second leads to distrust in the system. Both have the same root: ingestion is treated like a fire-and-forget function, when it’s actually a long-running pipeline with intermediate state worth preserving.</p> <p>CogniVault treats ingestion as a <strong>durable workflow</strong>. Specifically, a workflow checkpointed in Postgres, with content hashing for incremental work. This post walks through both pieces.</p> <h2 id="the-pipeline">The pipeline</h2> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">1. Scan docs/ → SHA-256 hash per file </span></span><span class="line"><span class="cl"> ├── New file → queue for embedding </span></span><span class="line"><span class="cl"> ├── Changed file → soft-delete old chunks, re-embed </span></span><span class="line"><span class="cl"> └── Unchanged → skip (idempotent) </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl">2. Extract text → per-format extractor (PDF/OCR, DOCX, PPTX, XLSX, MD, CSV, TXT, HTML) </span></span><span class="line"><span class="cl">3. Chunk → RecursiveCharacterTextSplitter (1000 chars, 100 overlap) </span></span><span class="line"><span class="cl">4. Embed → embeddinggemma via Ollama, batches of 5 </span></span><span class="line"><span class="cl">5. Save → append to FAISS IndexFlatIP + JSON metadata on disk </span></span></code></pre></div><p>The heavy stages run as DBOS steps inside one parent workflow, each one checkpointed: if the process dies between steps, the next start picks up at the last completed one.</p> <h2 id="sha-256-as-the-source-of-truth">SHA-256 as the source of truth</h2> <p>The naive approach is to track ingestion by filename. That breaks the first time someone edits a file in place. Filename is the same; content isn’t. The vector store quietly carries stale chunks.</p> <p>The fix is content-addressed: hash the file bytes, store the hash alongside the chunks. Every ingestion run:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">current_hash</span> <span class="o">=</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">file_bytes</span><span class="p">)</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="n">stored_hash</span> <span class="o">=</span> <span class="n">chunk_metadata_for</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"file_hash"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="n">stored_hash</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">schedule_ingest</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> <span class="c1"># new file</span> </span></span><span class="line"><span class="cl"><span class="k">elif</span> <span class="n">stored_hash</span> <span class="o">==</span> <span class="n">current_hash</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">skip</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> <span class="c1"># unchanged</span> </span></span><span class="line"><span class="cl"><span class="k">else</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">soft_delete_chunks_for</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> <span class="c1"># changed</span> </span></span><span class="line"><span class="cl"> <span class="n">schedule_ingest</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span> </span></span></code></pre></div><p>This gives ingestion an <strong>idempotent</strong> property that’s worth its weight in gold: running the pipeline twice in a row does almost nothing the second time. That’s not just an optimisation — it’s what makes the next section possible.</p> <h2 id="dbos-workflows">DBOS workflows</h2> <p> is a Python library that turns regular functions into checkpointed workflows backed by Postgres. The model is dead simple: decorate a function with <code>@DBOS.workflow()</code>, mark each long-running call inside it as a <code>@DBOS.step()</code>, and DBOS records each step’s input, output, and status in Postgres as it runs.</p> <p>If the workflow crashes — process killed, OS reboot, Postgres connection drop — the next start sees there’s an unfinished workflow with the same ID, replays the <em>recorded</em> step outputs from Postgres (without re-running them), and resumes from the first incomplete step.</p> <p>Here’s the actual step structure (slightly simplified from <code>backend/services/ingest.py</code>):</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nd">@DBOS.workflow</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">ingest_workflow</span><span class="p">()</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">filenames</span> <span class="o">=</span> <span class="n">list_document_files</span><span class="p">()</span> <span class="c1"># @DBOS.step — scan + hash check</span> </span></span><span class="line"><span class="cl"> <span class="n">docs</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">filenames</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">docs</span> <span class="o">+=</span> <span class="n">process_single_document</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="c1"># @DBOS.step — extract text, one file each</span> </span></span><span class="line"><span class="cl"> <span class="n">chunks</span> <span class="o">=</span> <span class="n">chunk</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span> <span class="c1"># plain Python — fast, re-runs freely</span> </span></span><span class="line"><span class="cl"> <span class="n">embeddings</span> <span class="o">=</span> <span class="p">[]</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">batches_of_5</span><span class="p">(</span><span class="n">chunks</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">embeddings</span> <span class="o">+=</span> <span class="n">embed_batch</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span> <span class="c1"># @DBOS.step — the slow one, retried on failure</span> </span></span><span class="line"><span class="cl"> <span class="n">save_vector_store</span><span class="p">(</span><span class="n">embeddings</span><span class="p">,</span> <span class="n">chunks</span><span class="p">)</span> <span class="c1"># @DBOS.step — append to FAISS + metadata</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span> </span></span></code></pre></div><p>The granularity of <code>@DBOS.step</code> is the granularity of crash recovery, and it’s chosen deliberately. Extraction is one step <strong>per file</strong>, so a crash during file 9 of 10 doesn’t re-read the first eight. Embedding is one step <strong>per batch of five chunks</strong>, for one specific reason: <strong><code>embed_batch</code> is the slow one.</strong> If the laptop dies during embeddings, we resume the embedding loop at the failed batch, not at PDF extraction.</p> <p>Notice what <em>isn’t</em> a step: chunking. Splitting text is fast pure-Python work — checkpointing it would cost more ledger bookkeeping than simply redoing it on a resume.</p> <p>There’s a related sizing trick hiding in the batch number. DBOS records each step’s output in Postgres, and <code>embed_batch</code> returns its vectors — so each ledger entry contains five embeddings’ worth of floats. Small batches keep each checkpoint record small and each retry cheap. One giant “embed everything” step would mean one giant ledger row and zero resume granularity.</p> <h2 id="the-format-extractors">The format extractors</h2> <p>Step 2 (<code>process_single_document</code>) is a dispatch on file extension. Each extractor is small and obvious; the interesting choices are in the chunking strategy each one feeds downstream.</p> <table> <thead> <tr> <th>Format</th> <th>Library</th> <th>Chunking note</th> </tr> </thead> <tbody> <tr> <td><strong>PDF</strong></td> <td><code>pypdf</code> page-by-page; <code>pytesseract</code> OCR fallback for image-only pages</td> <td>Recursive splitter, 1000/100</td> </tr> <tr> <td><strong>DOCX</strong></td> <td><code>python-docx</code> (paragraphs + table rows joined as text)</td> <td>Recursive splitter</td> </tr> <tr> <td><strong>PPTX</strong></td> <td><code>python-pptx</code></td> <td>One chunk per slide (title + body text)</td> </tr> <tr> <td><strong>XLSX</strong></td> <td><code>openpyxl</code></td> <td>Header + 20-row batches, per sheet</td> </tr> <tr> <td><strong>MD</strong></td> <td><code>MarkdownHeaderTextSplitter</code></td> <td>One chunk per H1/H2/H3 section, breadcrumb prepended</td> </tr> <tr> <td><strong>CSV</strong></td> <td>manual reader</td> <td>Header row + 20-row batches</td> </tr> <tr> <td><strong>TXT</strong></td> <td>raw UTF-8 read</td> <td>Recursive splitter</td> </tr> <tr> <td><strong>HTML</strong></td> <td><code>trafilatura</code> clean text</td> <td>Recursive splitter</td> </tr> </tbody> </table> <p>The OCR fallback is the one worth pausing on. PDFs come in two flavours: ones with a real text layer, and ones that are basically scanned images wearing a PDF costume. <code>pypdf</code> returns <em>nothing useful</em> for the second kind, but it doesn’t raise — it just hands back empty strings. Without a fallback, your “ingestion succeeded” log is lying to you.</p> <p>The detector is a heuristic: if <code>pypdf</code> returns fewer than 50 characters for a page, route the page through <code>pymupdf</code> → <code>Pillow</code> → <code>pytesseract</code> OCR. Slower, but at least produces text. The threshold is tuned to be sensitive enough to catch scanned pages while not punishing legitimately short pages (a chapter cover, a colophon).</p> <h2 id="soft-delete-not-hard-delete">Soft delete, not hard delete</h2> <p>When a file changes and we re-ingest, the old chunks need to go. The temptation is to physically remove them from the FAISS index, but FAISS <code>IndexFlatIP</code> doesn’t support efficient delete — you’d have to rebuild.</p> <p><strong>Soft delete</strong> instead: changed files get their old chunks marked with a <code>deleted: true</code> flag in the metadata; new chunks are appended without it. Search filters on the flag at query time, so stale vectors sit harmlessly in the index. If enough dead weight ever accumulates, the escape valve is obvious — rebuild the index from active chunks only — but in practice I haven’t needed it.</p> <p>This is the same pattern most append-only systems use. It pairs naturally with content hashing — flag-and-append is much cheaper than remove-and-rebuild. One subtlety: the keyword index has to follow suit. CogniVault’s <code>VectorDB.delete_by_source()</code> flips the flags <strong>and rebuilds BM25</strong> over the remaining active chunks, so the two retrievers never disagree about what exists.</p> <h2 id="what-the-user-sees">What the user sees</h2> <p>Starting an ingestion (<code>POST /ingest</code>) returns a <code>workflow_id</code>, and the frontend polls <code>GET /ingest/status/{workflow_id}</code> to draw a live timeline of the workflow’s steps — scanning, per-file extraction (“Reading pages… 3 of 21”), embedding (“Calibrating batch 4 of 12”), saving. If the user closes the tab mid-ingest, comes back five minutes later, and reopens — the workflow finished in the background regardless. The next call to <code>GET /api/vault/stats</code> reflects the new chunk count. No “click to resume” button, no manual recovery dance.</p> <p>The first time I closed the lid mid-embedding and watched the workflow pick itself up from the next step on resume, I’ll admit I was a little smug. That’s exactly the property I wanted, with surprisingly little code.</p> <h2 id="pitfalls-and-edges">Pitfalls and edges</h2> <p>A few things I had to learn the hard way:</p> <ul> <li><strong>Don’t make <code>embed_batch</code> too big.</strong> Ollama isn’t great at backpressure. Batches of 5 are a sweet spot for <code>embeddinggemma</code> on a 16 GB machine — bigger batches stall on memory, smaller ones waste round-trip overhead. (And as noted above, the batch size doubles as your checkpoint-record size.)</li> <li><strong>Be careful with file deletion.</strong> Soft-deleted chunks must also disappear from BM25’s corpus, or keyword search will keep returning text that dense search no longer sees. Rebuilding BM25 inside <code>delete_by_source()</code> keeps the two in lockstep.</li> <li><strong>OCR is slow.</strong> A 50-page scan can take a minute or more. Surface that latency to the user; otherwise they think it’s hanging.</li> </ul> <h2 id="takeaway">Takeaway</h2> <p>Durable workflows aren’t only for distributed systems. A single-user local app benefits from them in <em>exactly the same ways</em>: incremental work, crash recovery, idempotent retries. DBOS makes the cost of opting in trivially low — decorate your function, run Postgres locally, and you get a pipeline that survives lid-closes, OS updates, and your own <code>Ctrl-C</code>.</p> <p>Combined with content-addressed hashing, ingestion stops being a thing you avoid touching for fear of having to wait 20 minutes. It becomes a thing you re-run whenever you feel like it — because re-running is free when nothing has changed.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>DBOS</strong></td> <td>Database-Oriented Operating System</td> <td>A library that checkpoints workflow steps in Postgres so crashed jobs resume instead of restarting</td> </tr> <tr> <td><strong>SHA-256</strong></td> <td>Secure Hash Algorithm, 256-bit</td> <td>A content fingerprint: change one byte of a file and the hash changes completely</td> </tr> <tr> <td><strong>RAG</strong></td> <td>Retrieval-Augmented Generation</td> <td>Retrieve relevant passages from your own documents first; let the model answer from them</td> </tr> <tr> <td><strong>OCR</strong></td> <td>Optical Character Recognition</td> <td>Turning pictures of text (scanned pages) into machine-readable text</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>The vector index the embeddings are appended to</td> </tr> <tr> <td><strong>IP</strong> (in <code>IndexFlatIP</code>)</td> <td>Inner Product</td> <td>FAISS’s similarity measure; equals cosine similarity on normalised vectors</td> </tr> <tr> <td><strong>BM25</strong></td> <td>Best Match 25</td> <td>The keyword index that must stay in lockstep with FAISS on deletes</td> </tr> <tr> <td><strong>PDF / DOCX / PPTX / XLSX / MD / CSV / TXT / HTML</strong></td> <td>Portable Document Format / Word / PowerPoint / Excel / Markdown / Comma-Separated Values / plain text / HyperText Markup Language</td> <td>The formats the per-extension extractors handle</td> </tr> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The format of the chunk-metadata file next to the FAISS index</td> </tr> <tr> <td><strong>UTF-8</strong></td> <td>Unicode Transformation Format, 8-bit</td> <td>The text encoding used when reading plain-text files</td> </tr> <tr> <td><strong>OS</strong></td> <td>Operating System</td> <td>What reboots underneath you mid-ingest</td> </tr> </tbody> </table> <hr> <p><strong>Next up:</strong> — what happens after Gemma 4 enthusiastically returns <code>{"questions": [{"text": "..."},}]</code>.</p> </article> <article> <h1>Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts</h1> <p>Thu, 30 Apr 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>Part of a series on building . Previously: . All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>When I first wired up Gemma 4 with inside CogniVault, the chat felt slow. Not laggy — slow in a way that’s worse than laggy. The user types a question. The cursor sits there. Then, eventually, an answer drops out of the void.</p> <p>The model wasn’t idle. It was <em>thinking</em>. Gemma 4 has a chain-of-thought mode that produces a (sometimes long) reasoning trace before its final reply. With a single-phase agent stream, all of that thinking is happening <em>inside the agent loop</em> — silently — before any tool calls run, before any tokens get emitted to the UI.</p> <p>So I split the call into two phases.</p> <h2 id="the-shape">The shape</h2> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">POST /rag </span></span><span class="line"><span class="cl"> │ </span></span><span class="line"><span class="cl"> ├── Phase 1 — Direct Ollama call, thinking enabled </span></span><span class="line"><span class="cl"> │ stream: {"type":"thinking","data":"..."} (reasoning tokens) </span></span><span class="line"><span class="cl"> │ </span></span><span class="line"><span class="cl"> └── Phase 2 — Strands Agent (thinking disabled) </span></span><span class="line"><span class="cl"> stream: {"type":"metadata","data":{...}} (citations, as soon as search runs) </span></span><span class="line"><span class="cl"> stream: {"type":"text","data":"..."} (answer tokens) </span></span><span class="line"><span class="cl"> stream: {"type":"memory","data":{...}} (end-of-stream: session memory usage) </span></span></code></pre></div><p>The endpoint streams <strong>newline-delimited JSON</strong> (NDJSON): each line of the response body is one self-contained JSON envelope with a <code>type</code> and a <code>data</code>. The frontend dispatches on <code>type</code> and renders accordingly: a <strong>collapsible reasoning panel</strong> for the thinking tokens, the main message bubble for the text tokens, a sidebar card per citation.</p> <p>The user sees the model start thinking <em>immediately</em>. Latency to first byte drops from “long enough to wonder if it crashed” to “instant.” Total time to final answer doesn’t change. Perceived speed does.</p> <h2 id="phase-1--thinking-only">Phase 1 — Thinking only</h2> <p>Phase 1 is a single direct call to Ollama with thinking enabled. It gets exactly what Phase 2 will see — the same system prompt, the current question, and any attached images — so the reasoning reflects reality. Only the <em>reasoning</em> tokens are consumed; whatever answer text Phase 1 starts to produce is discarded, because we don’t want a half-formed answer competing with the real one.</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Simplified from backend/services/rag_agent.py</span> </span></span><span class="line"><span class="cl"><span class="n">client</span> <span class="o">=</span> <span class="n">ollama</span><span class="o">.</span><span class="n">AsyncClient</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="n">settings</span><span class="o">.</span><span class="n">ollama_host</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="n">stream</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="o">.</span><span class="n">chat</span><span class="p">(</span> </span></span><span class="line"><span class="cl"> <span class="n">model</span><span class="o">=</span><span class="n">settings</span><span class="o">.</span><span class="n">llm_model</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">messages</span><span class="o">=</span><span class="p">[</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span><span class="s2">"role"</span><span class="p">:</span> <span class="s2">"system"</span><span class="p">,</span> <span class="s2">"content"</span><span class="p">:</span> <span class="n">system_prompt</span><span class="p">},</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span><span class="s2">"role"</span><span class="p">:</span> <span class="s2">"user"</span><span class="p">,</span> <span class="s2">"content"</span><span class="p">:</span> <span class="n">query</span><span class="p">,</span> <span class="s2">"images"</span><span class="p">:</span> <span class="n">images</span><span class="p">},</span> </span></span><span class="line"><span class="cl"> <span class="p">],</span> </span></span><span class="line"><span class="cl"> <span class="n">options</span><span class="o">=</span><span class="p">{</span><span class="s2">"thinking"</span><span class="p">:</span> <span class="kc">True</span><span class="p">},</span> </span></span><span class="line"><span class="cl"> <span class="n">stream</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> </span></span><span class="line"><span class="cl"><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">stream</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">chunk</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">thinking</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">yield</span> <span class="n">envelope</span><span class="p">(</span><span class="s2">"thinking"</span><span class="p">,</span> <span class="n">chunk</span><span class="o">.</span><span class="n">message</span><span class="o">.</span><span class="n">thinking</span><span class="p">)</span> </span></span></code></pre></div><p>Phase 1 is deliberately <strong>best-effort</strong>: any failure here is swallowed and logged, and the stream moves straight on to Phase 2. A broken reasoning panel should never cost the user their answer.</p> <h2 id="phase-2--agent-with-tools">Phase 2 — Agent with tools</h2> <p>Phase 2 builds a <strong>fresh Strands <code>Agent</code> per request</strong> — no shared mutable state between concurrent chats — restores the session’s conversation history into it, and runs the tool loop with six tools registered:</p> <table> <thead> <tr> <th>Tool</th> <th>Purpose</th> </tr> </thead> <tbody> <tr> <td><code>search_knowledge_base(query)</code></td> <td>Hybrid FAISS + BM25 search, top-7, RRF fusion. Scope-filter-aware.</td> </tr> <tr> <td><code>list_documents()</code></td> <td>Inventory of every indexed file with type and chunk count.</td> </tr> <tr> <td><code>analyze_document(filename)</code></td> <td>Inner Gemma call → structured summary (topics, entities, key facts).</td> </tr> <tr> <td><code>compare_documents(doc_a, doc_b, question)</code></td> <td>Inner Gemma call answering across two documents.</td> </tr> <tr> <td><code>calculator(expression)</code></td> <td>Safe AST evaluator — no <code>eval()</code>, no arbitrary code.</td> </tr> <tr> <td><code>current_time()</code></td> <td>Timestamp for time-aware queries.</td> </tr> </tbody> </table> <p>The agent decides which tools to call and in what order. There’s no hard-coded router; the system prompt explains what’s available and Strands handles the loop. For most document questions the path is: <code>search_knowledge_base</code> → answer. For comparisons: <code>compare_documents</code> → answer. For “what files do I have?”: <code>list_documents</code> → answer. For greetings and arithmetic, the system prompt tells the agent it may skip search entirely. The model picks.</p> <p>Two details that took debugging to get right:</p> <ul> <li><strong>Phase 2 runs with thinking explicitly disabled.</strong> Without that flag, Gemma’s default behaviour can leak <code><think>…</think></code> tags into the visible answer, and everything before the closing tag gets swallowed by the Markdown renderer. One model option — <code>options={"thinking": False}</code> — fixed a “truncated responses” bug that looked much scarier than it was.</li> <li><strong>Citations are flushed before the first answer token.</strong> Tools run before text deltas arrive, so by the time the first visible token streams, every source the search found is already in the sidebar. The accumulator is a request-local <code>ContextVar</code> the search tool appends to.</li> </ul> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Simplified — the real loop reads Strands' raw event dicts</span> </span></span><span class="line"><span class="cl"><span class="k">async</span> <span class="k">for</span> <span class="n">event</span> <span class="ow">in</span> <span class="n">agent</span><span class="o">.</span><span class="n">stream_async</span><span class="p">(</span><span class="n">user_input</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">delta</span> <span class="o">=</span> <span class="n">event</span><span class="p">[</span><span class="s2">"event"</span><span class="p">]</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"contentBlockDelta"</span><span class="p">,</span> <span class="p">{})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"delta"</span><span class="p">,</span> <span class="p">{})</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"text"</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="n">delta</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">new_citations</span><span class="p">():</span> <span class="c1"># drain the ContextVar accumulator</span> </span></span><span class="line"><span class="cl"> <span class="k">yield</span> <span class="n">envelope</span><span class="p">(</span><span class="s2">"metadata"</span><span class="p">,</span> <span class="n">doc</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">yield</span> <span class="n">envelope</span><span class="p">(</span><span class="s2">"text"</span><span class="p">,</span> <span class="n">delta</span><span class="p">)</span> </span></span></code></pre></div><h2 id="why-this-matters-more-than-it-sounds">Why this matters more than it sounds</h2> <p>You could implement similar behaviour with one agent call that interleaves <code>thinking</code> events with <code>text</code> events. The reasons I split it anyway:</p> <ol> <li> <p><strong>The thinking model and the tool model can be different.</strong> Right now they’re both <code>gemma4:e4b</code>, but the architecture lets me swap a smaller, faster model in for Phase 1 reasoning and keep the big one for Phase 2 tool use. I’m not doing that yet — but I want the option.</p> </li> <li> <p><strong>Phase 1 always streams immediately.</strong> A pure agent loop only starts producing tokens after the model has decided what to say. Two-phase guarantees the user sees activity almost as soon as they press Enter, regardless of how complex the Phase 2 tool work gets.</p> </li> <li> <p><strong>Failures isolate.</strong> If Phase 2 falls over (Ollama timeout, tool error), Phase 1’s reasoning is still visible — the user can see <em>what the model was trying to do</em>, which makes the error far less frustrating than a blank “something went wrong.”</p> </li> </ol> <h2 id="contextvar-isolation-again">ContextVar isolation, again</h2> <p>The same <code>ContextVar</code> trick that scopes retrieval in carries here. At the start of each <code>/rag</code> stream, the handler sets two request-local variables: the <strong>document-scope filter</strong> and the <strong>citation accumulator</strong>. The agent’s tools read and write them implicitly. Conversation history itself lives in a per-session store guarded by per-session <code>asyncio</code> locks, so two concurrent requests in the same chat can’t corrupt each other either.</p> <p>Tested with two browser tabs open on the same backend, scoped to different document categories, sending overlapping queries simultaneously. Zero cross-contamination. The test suite covers this explicitly in <code>test_thinking.py</code> and <code>test_doc_scope_filter.py</code> — see for the broader story.</p> <h2 id="the-frontend-side-of-the-contract">The frontend side of the contract</h2> <p>A detail that tripped me up: this is a <code>POST</code> endpoint, so the browser’s <code>EventSource</code> API (which only does GET) is out. The frontend uses <code>fetch</code> and reads the response body incrementally, splitting on newlines and parsing each line as JSON:</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-tsx" data-lang="tsx"><span class="line"><span class="cl"><span class="c1">// Simplified from useRagStream.ts </span></span></span><span class="line"><span class="cl"><span class="kr">const</span> <span class="nx">res</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">fetch</span><span class="p">(</span><span class="s2">"/rag"</span><span class="p">,</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="nx">method</span><span class="o">:</span> <span class="s2">"POST"</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">body</span>: <span class="kt">JSON.stringify</span><span class="p">(</span><span class="nx">payload</span><span class="p">),</span> </span></span><span class="line"><span class="cl"><span class="p">});</span> </span></span><span class="line"><span class="cl"><span class="kr">const</span> <span class="nx">reader</span> <span class="o">=</span> <span class="nx">res</span><span class="p">.</span><span class="nx">body</span><span class="o">!</span><span class="p">.</span><span class="nx">getReader</span><span class="p">();</span> </span></span><span class="line"><span class="cl"><span class="kr">const</span> <span class="nx">decoder</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">TextDecoder</span><span class="p">();</span> </span></span><span class="line"><span class="cl"><span class="kd">let</span> <span class="nx">buffer</span> <span class="o">=</span> <span class="s2">""</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">while</span> <span class="p">(</span><span class="kc">true</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="kr">const</span> <span class="p">{</span> <span class="nx">done</span><span class="p">,</span> <span class="nx">value</span> <span class="p">}</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">reader</span><span class="p">.</span><span class="nx">read</span><span class="p">();</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">(</span><span class="nx">done</span><span class="p">)</span> <span class="k">break</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="nx">buffer</span> <span class="o">+=</span> <span class="nx">decoder</span><span class="p">.</span><span class="nx">decode</span><span class="p">(</span><span class="nx">value</span><span class="p">,</span> <span class="p">{</span> <span class="nx">stream</span>: <span class="kt">true</span> <span class="p">});</span> </span></span><span class="line"><span class="cl"> <span class="kr">const</span> <span class="nx">lines</span> <span class="o">=</span> <span class="nx">buffer</span><span class="p">.</span><span class="nx">split</span><span class="p">(</span><span class="s2">"\n"</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="nx">buffer</span> <span class="o">=</span> <span class="nx">lines</span><span class="p">.</span><span class="nx">pop</span><span class="p">()</span><span class="o">!</span><span class="p">;</span> <span class="c1">// keep the trailing partial line </span></span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="p">(</span><span class="kr">const</span> <span class="nx">line</span> <span class="k">of</span> <span class="nx">lines</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nx">line</span><span class="p">.</span><span class="nx">trim</span><span class="p">())</span> <span class="k">continue</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="kr">const</span> <span class="p">{</span> <span class="kr">type</span><span class="p">,</span> <span class="nx">data</span> <span class="p">}</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">line</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">switch</span> <span class="p">(</span><span class="kr">type</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">case</span> <span class="s2">"thinking"</span><span class="o">:</span> </span></span><span class="line"><span class="cl"> <span class="nx">appendThinking</span><span class="p">(</span><span class="nx">data</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">break</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">case</span> <span class="s2">"text"</span><span class="o">:</span> </span></span><span class="line"><span class="cl"> <span class="nx">appendText</span><span class="p">(</span><span class="nx">data</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">break</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">case</span> <span class="s2">"metadata"</span><span class="o">:</span> </span></span><span class="line"><span class="cl"> <span class="nx">addCitation</span><span class="p">(</span><span class="nx">data</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">break</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">case</span> <span class="s2">"memory"</span><span class="o">:</span> </span></span><span class="line"><span class="cl"> <span class="nx">updateMemoryMeter</span><span class="p">(</span><span class="nx">data</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">break</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"><span class="p">}</span> </span></span></code></pre></div><p>The reasoning panel starts <strong>collapsed</strong>, with a small pulsing indicator while thinking tokens are still streaming — enough to signal “the model is working” without shoving a wall of chain-of-thought at the user. One click expands the full trace, during or after the stream.</p> <h2 id="what-id-revisit">What I’d revisit</h2> <ul> <li><strong>Phase 1 reasons toward a full answer, and we throw the answer part away.</strong> A dedicated “plan your approach, don’t answer yet” prompt for Phase 1 would make the reasoning trace tighter and cheaper. Today it shares the main system prompt — simpler, but the trace can ramble.</li> <li><strong>No interrupt yet.</strong> Once Phase 1 starts, it runs to completion. If the user types a follow-up mid-stream we let it finish. A real cancel button would mean wiring an abort signal through Ollama’s HTTP client — feasible, not yet done.</li> <li><strong>Phase 1 occasionally over-thinks.</strong> Greetings and trivial questions still produce a paragraph of reasoning. A “should I think?” gate (probably a tiny classifier or even a heuristic on query length) would skip Phase 1 entirely for those cases.</li> </ul> <h2 id="takeaway">Takeaway</h2> <p>Streaming is <em>not</em> just an optimisation. It’s a UX primitive. Two-phase streaming buys you a free property: the <em>visible</em> part of the interaction starts before the <em>slow</em> part does. The user gets to watch the model think, which is — genuinely — more interesting than watching a spinner.</p> <p>If your agent app feels slow even though the answers are fast, look at <em>when</em> tokens start flowing. The fix often isn’t a faster model.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>NDJSON</strong></td> <td>Newline-Delimited JSON</td> <td>A stream where each line is its own complete JSON object — what <code>/rag</code> emits</td> </tr> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The universal text format for structured data</td> </tr> <tr> <td><strong>UX</strong></td> <td>User Experience</td> <td>How the product feels to use — the real beneficiary of two-phase streaming</td> </tr> <tr> <td><strong>UI</strong></td> <td>User Interface</td> <td>The visible surface the stream renders into</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>The dense half of hybrid retrieval (previous post)</td> </tr> <tr> <td><strong>BM25</strong></td> <td>Best Match 25</td> <td>The keyword half of hybrid retrieval (previous post)</td> </tr> <tr> <td><strong>RRF</strong></td> <td>Reciprocal Rank Fusion</td> <td>The rank-only formula that merges the two result lists</td> </tr> <tr> <td><strong>AST</strong></td> <td>Abstract Syntax Tree</td> <td>The parsed form of an expression — how the calculator evaluates maths without <code>eval()</code></td> </tr> <tr> <td><strong>HTTP</strong></td> <td>HyperText Transfer Protocol</td> <td>The protocol carrying the stream</td> </tr> <tr> <td><strong>SSE</strong></td> <td>Server-Sent Events</td> <td>The browser’s built-in GET-only streaming format — notably <em>not</em> usable here, because <code>/rag</code> is a POST</td> </tr> <tr> <td><strong>API</strong></td> <td>Application Programming Interface</td> <td>The boundary the frontend calls</td> </tr> </tbody> </table> <hr> <p><strong>Next up:</strong> — how CogniVault re-ingests edited PDFs without re-embedding everything, and survives a kill -9 mid-pipeline.</p> </article> <article> <h1>Part 2 · Hybrid Retrieval in Practice: FAISS + BM25, Fused with RRF</h1> <p>Sat, 25 Apr 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>Part of a series on building , a fully local AI study companion. Previous: .</p> </blockquote> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>The first version of CogniVault used pure dense retrieval — embed the query with <code>embeddinggemma</code>, search a FAISS index, pass the top-7 chunks to the model. It worked. It worked <em>beautifully</em> — until a user uploaded a PDF containing some German legal text and asked for “§3 Absatz 2.”</p> <p>The model couldn’t find it.</p> <p>The chunk was <em>right there</em>. The PDF was indexed. But “§3 Absatz 2” doesn’t embed into anything semantically meaningful — it’s a token-level identifier, not a concept. The dense vector for the query landed nowhere near the dense vector for the chunk, even though the chunk literally contains the string the user asked for.</p> <p>That bug killed pure dense retrieval for me. This post is about what replaced it.</p> <h2 id="two-kinds-of-similar">Two kinds of “similar”</h2> <p>You already use both kinds of search every day. When Spotify builds a “song radio” from a track you like, it’s matching <em>feel</em> — tempo, mood, genre — and it will happily play you a song whose title shares no words with the original. But when you type <code>Bohemian Rhapsody remastered 2011</code> into the search box, you don’t want <em>feel</em>. You want that exact string, and “a similar operatic rock epic” is a wrong answer.</p> <p>Search systems formalise that split into two notions of similarity:</p> <ul> <li><strong>Lexical similarity</strong> — “do these strings share rare words?” This is what TF-IDF and BM25 model. They thrive on identifiers, names, code, technical terminology, and direct quotes.</li> <li><strong>Semantic similarity</strong> — “do these passages talk about the same idea, even with different words?” This is what embeddings model. They thrive on paraphrase, conceptual queries, and natural-language questions.</li> </ul> <p>Neither subsumes the other. A user asking <em>“how is the practical exam structured?”</em> needs <strong>semantic</strong> search — the document doesn’t say “structure of practical exam.” A user asking <em>"§3 Absatz 2"</em> needs <strong>lexical</strong> search — there’s no concept to embed, just a literal string.</p> <p>Production RAG has to do both. CogniVault does both, and then fuses the result lists with <strong>Reciprocal Rank Fusion (RRF)</strong>.</p> <h2 id="the-stack">The stack</h2> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Query </span></span><span class="line"><span class="cl"> ├── embed via embeddinggemma ──► FAISS IndexFlatIP ──► top-K dense </span></span><span class="line"><span class="cl"> └── tokenize + lowercase ──► BM25Okapi ──► top-K sparse </span></span><span class="line"><span class="cl"> │ </span></span><span class="line"><span class="cl"> Reciprocal Rank Fusion ◄──┘ </span></span><span class="line"><span class="cl"> │ </span></span><span class="line"><span class="cl"> top-7 fused chunks </span></span></code></pre></div><p>Both indexes live <strong>in memory</strong>, fronted by a <code>VectorDB</code> singleton. FAISS does inner-product search over normalised embeddings (so dot product = cosine). BM25 is <code>rank_bm25</code>’s <code>BM25Okapi</code>, fed the same chunks tokenised by a simple lowercase-and-split tokenizer.</p> <p>The corpora are kept in lockstep: soft-deleting a file’s chunks triggers a BM25 rebuild over the remaining active chunks, and the singleton reloads both indexes from <code>vector_store.faiss</code> + <code>vector_store.json</code> (chunk metadata + raw text) after every ingestion run and on app start.</p> <h2 id="why-faiss-indexflatip-not-hnsw-or-ivf">Why FAISS <code>IndexFlatIP</code>, not HNSW or IVF?</h2> <p><code>IndexFlatIP</code> is brute-force exact search. It scans every vector, every query. For tens of thousands of chunks that’s fine — sub-millisecond on a laptop. CogniVault is a <strong>single-user, local-first</strong> app; the index is never going to be billions of vectors. Trading recall for speed via HNSW or IVF would buy nothing here and lose the “exact” guarantee. Boring, correct, fast enough.</p> <p>When the corpus grows large enough that brute-force gets sticky, switching is a one-line change. Until then, the simplest index wins.</p> <h2 id="reciprocal-rank-fusion">Reciprocal Rank Fusion</h2> <p>The naive way to combine two ranked lists is to score them and add. That sounds reasonable until you remember FAISS returns inner-product scores in some bounded range and BM25 returns scores in an unbounded one — they aren’t comparable without normalisation, and any normalisation you pick is somewhat arbitrary.</p> <p><strong>RRF sidesteps the problem entirely.</strong> It only looks at <em>ranks</em>, not scores. For each result list, an item at rank <code>r</code> contributes <code>1 / (k + r)</code> to its final score (with <code>k = 60</code> by convention — large enough to flatten the tail, small enough that the top items still dominate). Items that appear in both lists get summed.</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Simplified — the real implementation also de-duplicates chunks</span> </span></span><span class="line"><span class="cl"><span class="c1"># by (source, chunk_id, page) before scoring.</span> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">reciprocal_rank_fusion</span><span class="p">(</span><span class="n">result_lists</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">60</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">results</span> <span class="ow">in</span> <span class="n">result_lists</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">chunk_id</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">start</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span> </span></span><span class="line"><span class="cl"> <span class="n">scores</span><span class="p">[</span><span class="n">chunk_id</span><span class="p">]</span> <span class="o">+=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="p">(</span><span class="n">k</span> <span class="o">+</span> <span class="n">rank</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">scores</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">kv</span><span class="p">:</span> <span class="n">kv</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> </span></span></code></pre></div><p>That’s the whole algorithm. No tuning, no calibration, no per-corpus weights. A chunk that’s #1 in BM25 and #4 in FAISS easily beats a chunk that’s #2 in only one of them. A chunk that <em>both</em> indexes agree on rises to the top deterministically.</p> <p>The result for the “§3 Absatz 2” query: BM25 finds the literal match and lands it at rank 1. FAISS finds nothing useful (its top hits are about exam regulations in general). RRF surfaces the BM25 hit at the top of the fused list. Problem solved.</p> <h2 id="scope-filtering-with-contextvar-isolation">Scope filtering with ContextVar isolation</h2> <p>One detail that’s easy to get wrong: the retriever has to be <em>scope-aware</em>. CogniVault lets users limit a question to a single category or specific files. The scope is set by the request, but the search is called from deep inside the Strands agent loop, which is called from a streaming FastAPI handler, possibly with multiple concurrent requests in flight per worker.</p> <p>Threading the scope through every function call would be ugly. A global is unsafe. The right primitive is Python’s , which gives you per-task isolated state that asyncio and threads both respect.</p> <div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">contextvars</span> <span class="kn">import</span> <span class="n">ContextVar</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="n">_doc_scope</span><span class="p">:</span> <span class="n">ContextVar</span><span class="p">[</span><span class="n">DocScope</span> <span class="o">|</span> <span class="kc">None</span><span class="p">]</span> <span class="o">=</span> <span class="n">ContextVar</span><span class="p">(</span><span class="s2">"doc_scope"</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="kc">None</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">set_doc_scope</span><span class="p">(</span><span class="n">scope</span><span class="p">:</span> <span class="n">DocScope</span> <span class="o">|</span> <span class="kc">None</span><span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="n">_doc_scope</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">scope</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">current_doc_scope</span><span class="p">()</span> <span class="o">-></span> <span class="n">DocScope</span> <span class="o">|</span> <span class="kc">None</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">_doc_scope</span><span class="o">.</span><span class="n">get</span><span class="p">()</span> </span></span></code></pre></div><p>The <code>/rag</code> request handler sets the scope at the very start of each streaming response; the search tool reads it; because the value is task-local, it dies with the request. No globals, no parameter drilling, no race conditions across concurrent users.</p> <p>This is one of those design choices that looks like over-engineering until you have two browser tabs open and realise that without it, tab A’s scope filter would leak into tab B’s question.</p> <h2 id="chunking-choices-that-pay-off-downstream">Chunking choices that pay off downstream</h2> <p>Hybrid retrieval is only as good as the chunks. CogniVault uses a <code>RecursiveCharacterTextSplitter</code> with <strong>1,000 characters, 100 overlap</strong> for unstructured text — small enough to keep retrieval precise, large enough to carry context for the model.</p> <p>For structured formats it switches strategy:</p> <ul> <li><strong>Markdown</strong> → <code>MarkdownHeaderTextSplitter</code> emits one chunk per H1/H2/H3 section with the heading hierarchy prepended as a breadcrumb (“Privacy > Vault Audit > Indicators”). BM25 loves breadcrumbs — they make heading-keyword queries match cleanly.</li> <li><strong>CSV</strong> → header row + 20-row batches per chunk, so a query for a column name lands on the right block.</li> <li><strong>PPTX</strong> → one chunk per slide, title and body text together.</li> <li><strong>XLSX</strong> → header + row batches, per sheet, with a <code>[Sheet: name]</code> prefix.</li> </ul> <p>Tiny fragments get filtered: unstructured text needs at least <strong>100 characters</strong> to become a chunk, while the structured formats drop the bar to <strong>20</strong> — a two-line Markdown section or a header-only sheet is short but still meaningful. The recursive splitter is well-trodden territory, but the per-format strategies matter much more than people give them credit for.</p> <h2 id="what-id-do-differently">What I’d do differently</h2> <p>A few things I’d revisit if I were starting over:</p> <ul> <li><strong>Stop tokenising for BM25 with <code>str.split()</code>.</strong> It’s fine, but a real tokenizer that handles punctuation and German compounds would meaningfully improve recall on the legal docs.</li> <li><strong>Add a small reranker.</strong> RRF gets the right <em>set</em>, but a cross-encoder rerank on the top 20 would polish the <em>order</em>. Locally-served, of course — there are good small ones now.</li> <li><strong>Query expansion for thin queries.</strong> Two-word questions like “§3 exam” could be expanded via a quick <code>gemma4</code> call before retrieval. Latency cost, recall gain.</li> </ul> <p>None of those are in the box yet. RRF over FAISS + BM25 is already so much better than either alone that I haven’t felt the pull to optimise further.</p> <h2 id="the-takeaway">The takeaway</h2> <p>If your retrieval is “embed + cosine + top-k,” it will fail in exactly the way mine did — on the queries that contain literal identifiers your model has no embedding for. The fix isn’t a better embedding model. It’s a second retriever that doesn’t pretend everything is a concept.</p> <p>FAISS for ideas. BM25 for strings. RRF to decide which one was right today.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>RAG</strong></td> <td>Retrieval-Augmented Generation</td> <td>Retrieve relevant passages from your own documents first; let the model answer from them</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>Meta’s library for storing vectors and finding the most similar ones fast</td> </tr> <tr> <td><strong>BM25</strong></td> <td>Best Match 25</td> <td>A keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system</td> </tr> <tr> <td><strong>RRF</strong></td> <td>Reciprocal Rank Fusion</td> <td>Merges ranked lists using only ranks: each item scores <code>Σ 1/(k + rank)</code> across lists</td> </tr> <tr> <td><strong>TF-IDF</strong></td> <td>Term Frequency–Inverse Document Frequency</td> <td>BM25’s ancestor: score words by how often they appear here vs how rare they are everywhere</td> </tr> <tr> <td><strong>IP</strong> (in <code>IndexFlatIP</code>)</td> <td>Inner Product</td> <td>The similarity measure FAISS computes; on normalised vectors it equals cosine similarity</td> </tr> <tr> <td><strong>HNSW</strong></td> <td>Hierarchical Navigable Small World</td> <td>A popular <em>approximate</em> vector-index structure — deliberately not used here</td> </tr> <tr> <td><strong>IVF</strong></td> <td>Inverted File Index</td> <td>Another approximate FAISS index type — also deliberately not used</td> </tr> <tr> <td><strong>AEVO</strong></td> <td>Ausbildereignungsverordnung</td> <td>The German trainer-aptitude regulation whose “§3 Absatz 2” query broke pure dense retrieval</td> </tr> <tr> <td><strong>CSV / PPTX / XLSX</strong></td> <td>Comma-Separated Values / PowerPoint / Excel (Office Open XML)</td> <td>Structured formats with their own chunking strategies</td> </tr> <tr> <td><strong>H1/H2/H3</strong></td> <td>Heading levels 1–3</td> <td>The Markdown heading tiers used to split sections</td> </tr> </tbody> </table> <hr> <p><strong>Next up:</strong> — how CogniVault’s <code>/rag</code> endpoint streams Gemma 4’s <em>thinking</em> before any tool calls run.</p> </article> <article> <h1>Part 1 · Why I Built a Local-First RAG</h1> <p>Mon, 20 Apr 2026 00:00:00 +0000</p> <blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"> <p>All abbreviations are fully explained in the appendix at the bottom of the page.</p> </blockquote> <p>I’ve spent the last few years in front of virtual classrooms full of career-changers in Germany, walking them through programming basics, web development, and introductory AI courses. Most of the information we deal with is fine to paste into cloud-based AI tools. Some of it really isn’t.</p> <p>Exam materials under confidentiality. A trainee’s portfolio with personal details. Other private documents that should never end up training someone else’s model.</p> <p>So I built — a fully local AI study and productivity tool. No cloud. No telemetry. No “we may use this data to improve our service.” Just Gemma 4 running on Ollama, on my laptop, talking to my files.</p> <h2 id="the-leaky-abstraction">The leaky abstraction</h2> <p>The pitch for cloud AI is great: a giant model, available instantly, billed by the token. The fine print is where it gets uncomfortable:</p> <ul> <li>Where does the data physically live during inference?</li> <li>Whose jurisdiction governs that hardware this afternoon?</li> <li>Does the <em>audit trail</em> stop at the API boundary, or can you actually trace what happened to your bytes?</li> <li>When you tick “do not train on my data,” are you trusting a control, a contract, or both?</li> </ul> <p>For most consumer use cases, those questions are fine to wave away. For <strong>education, healthcare, finance, legal, public administration</strong> — the answer “trust us” isn’t an answer.</p> <h2 id="what-local-first-actually-means-here">What “local-first” actually means here</h2> <p>Lots of products say “private.” I wanted three concrete properties:</p> <ol> <li><strong>The model lives on your machine.</strong> Gemma 4 (<code>gemma4:e4b</code>) and <code>embeddinggemma</code> are pulled via Ollama. Inference is a localhost HTTP call.</li> <li><strong>Your documents never leave.</strong> Vectors, chunks, chat history, study sessions, achievements — all on disk on your computer.</li> <li><strong>You can <em>verify</em> it.</strong> Gemma CogniVault ships a <strong>Privacy Audit Panel</strong> that shows a live “zero external connections” indicator alongside document counts and the Ollama host. It’s not a promise — it’s a status light.</li> </ol> <p>If a future build of Gemma CogniVault ever made an outbound call, that panel would be the first thing to scream.</p> <h2 id="what-you-get-back">What you get back</h2> <p>Going local sounds like a trade-off — surely you lose the magic of the giant frontier models? In practice, with <strong>Gemma 4</strong> you get more than enough:</p> <ul> <li><strong>Thinking mode</strong> — Gemma 4’s chain-of-thought streams into a collapsible panel before the answer. Watching the model reason about your documents is genuinely useful as a teaching tool.</li> <li><strong>Tool use</strong> — through the , the model decides when to search the knowledge base, summarise a document, compare two files, or check the time.</li> <li><strong>Vision</strong> — attach images and PDFs straight into a chat turn.</li> <li><strong>Generation that’s actually structured</strong> — quizzes, multi-lesson workshops, flashcard decks, and interactive mindmaps, generated with <code>format="json"</code> so the output parses reliably.</li> </ul> <p>Cognivault doesn’t try to be a giant ecosystem. It’s a single-purpose tool that does one thing well: use your own documents with a capable local model in a private environment. I must admit that it was inspired to a great extent by , which I’ve found incredibly useful but not private enough for my needs.</p> <h2 id="the-shape-of-the-app">The shape of the app</h2> <p>CogniVault is split into four sections that map to how I actually work with information on cloud-based AI tools:</p> <table> <thead> <tr> <th>Section</th> <th>What it’s for</th> </tr> </thead> <tbody> <tr> <td><strong>Chat</strong></td> <td>Ask anything about your documents. Cited answers, scope filter, voice in.</td> </tr> <tr> <td><strong>Knowledge Base</strong></td> <td>Upload, categorise, manage. SHA-256 detects edits on re-upload.</td> </tr> <tr> <td><strong>Study Hub</strong></td> <td>Quiz · Workshop · Flashcards · Mindmaps — four ways to drill into the source.</td> </tr> <tr> <td><strong>Dashboard</strong></td> <td>Total study time, streak, 25 badges, GitHub-style 90-day heatmap.</td> </tr> </tbody> </table> <p>Everything reachable from a sidebar that remembers where you left off, on a stack that fits in your <code>~/Documents</code> folder.</p> <h2 id="what-comes-next">What comes next</h2> <p>This is the first in a short series. Over the next few posts I’ll dig into the parts I’m most proud of — and a few I’d build differently next time:</p> <ul> <li><strong>Hybrid retrieval</strong> — why FAISS <em>and</em> BM25, fused with Reciprocal Rank Fusion</li> <li><strong>Two-phase streaming</strong> with Gemma 4 and Strands Agents</li> <li><strong>Crash-resumable ingestion</strong> with DBOS, hash-aware re-ingest, OCR fallback</li> <li><strong>Getting reliable JSON</strong> out of a local LLM (and what to do when it fails)</li> <li><strong>The mindmap renderer</strong> — what hand-rolling SVG taught me, and why v2 uses React Flow</li> <li><strong>Gamifying learning</strong> — 25 badges, idle-gap sessions, 90-day heatmap</li> <li><strong>Testing a local-AI app</strong> with 350+ tests and zero infrastructure</li> </ul> <p>If you want to skip ahead, the code is open source at , and there’s a .</p> <p>Your data. Your hardware. Your AI. Your vault.</p> <hr> <h2 id="appendix-abbreviations-in-this-post">Appendix: Abbreviations in this post</h2> <table> <thead> <tr> <th>Abbreviation</th> <th>Full form</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><strong>RAG</strong></td> <td>Retrieval-Augmented Generation</td> <td>Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory</td> </tr> <tr> <td><strong>AI</strong></td> <td>Artificial Intelligence</td> <td>Software performing tasks that normally need human intelligence</td> </tr> <tr> <td><strong>LLM</strong></td> <td>Large Language Model</td> <td>A neural network trained on huge amounts of text that can read and generate language</td> </tr> <tr> <td><strong>HTTP</strong></td> <td>HyperText Transfer Protocol</td> <td>The protocol browsers and APIs use to exchange requests and responses</td> </tr> <tr> <td><strong>API</strong></td> <td>Application Programming Interface</td> <td>The boundary where you call someone else’s software — and where cloud audit trails stop</td> </tr> <tr> <td><strong>IHK</strong></td> <td>Industrie- und Handelskammer</td> <td>The German Chamber of Commerce and Industry, which administers trainer certification</td> </tr> <tr> <td><strong>AEVO</strong></td> <td>Ausbildereignungsverordnung</td> <td>The German trainer-aptitude regulation — the exam material that motivated this project</td> </tr> <tr> <td><strong>FAISS</strong></td> <td>Facebook AI Similarity Search</td> <td>Meta’s vector-search library (covered in the next post)</td> </tr> <tr> <td><strong>BM25</strong></td> <td>Best Match 25</td> <td>A classic keyword-ranking formula (also next post)</td> </tr> <tr> <td><strong>SDK</strong></td> <td>Software Development Kit</td> <td>A library of building blocks — here, Strands, which provides the agent loop</td> </tr> <tr> <td><strong>JSON</strong></td> <td>JavaScript Object Notation</td> <td>The universal text format for structured data</td> </tr> <tr> <td><strong>PDF</strong></td> <td>Portable Document Format</td> <td>One of the eight-plus file types CogniVault ingests</td> </tr> <tr> <td><strong>SHA-256</strong></td> <td>Secure Hash Algorithm, 256-bit</td> <td>A content fingerprint used to detect edited files on re-upload</td> </tr> <tr> <td><strong>OCR</strong></td> <td>Optical Character Recognition</td> <td>Turning pictures of text (scans) into machine-readable text</td> </tr> <tr> <td><strong>DBOS</strong></td> <td>Database-Oriented Operating System</td> <td>The durable-workflow library behind crash-resumable ingestion</td> </tr> <tr> <td><strong>SVG</strong></td> <td>Scalable Vector Graphics</td> <td>The browser’s built-in vector drawing format</td> </tr> </tbody> </table> </article> <article> <h1>Uses</h1> <p>Tue, 24 Oct 2023 00:00:00 +0000</p> <p>This page is a living document of the tools, technologies, and setup I use daily as a developer and trainer.</p> <h2 id="languages--frameworks">Languages & Frameworks</h2> <ul> <li>HTML, CSS, Vanilla JavaScript</li> <li>TypeScript</li> <li>ReactJS, NextJS, AngularJS</li> <li>React Native</li> <li>NodeJS, ExpressJS</li> <li>C#, .NET</li> </ul> <h2 id="databases">Databases</h2> <ul> <li>MongoDB, Firebase</li> <li>PostgreSQL, MySQL</li> </ul> <h2 id="devops--tooling">DevOps & Tooling</h2> <ul> <li>Git, GitHub, GitHub Actions</li> <li>Docker</li> <li>AWS (S3, DynamoDB, Amplify)</li> <li>CI/CD pipelines</li> </ul> <h2 id="styling">Styling</h2> <ul> <li>CSS, SCSS, TailwindCSS</li> </ul> <h2 id="editor--terminal">Editor + Terminal</h2> <ul> <li> — my editor of choice</li> <li>Chrome — main browser</li> <li>Integrated terminal in VS Code</li> </ul> <h2 id="website">Website</h2> <ul> <li>Built with and deployed on </li> </ul> </article> </main></body></html>