<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>FastAPI |</title><link>https://aretascodes.dev/tags/fastapi/</link><atom:link href="https://aretascodes.dev/tags/fastapi/index.xml" rel="self" type="application/rss+xml"/><description>FastAPI</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 12 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://aretascodes.dev/media/icon_hu_2ab4f4763b27c75b.png</url><title>FastAPI</title><link>https://aretascodes.dev/tags/fastapi/</link></image><item><title>CogniVault Backend Explained, Part 1 · Meet the Backend: Three Processes, Four Layers</title><link>https://aretascodes.dev/blog/backend-explained-meet-the-backend/</link><pubDate>Fri, 12 Jun 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/backend-explained-meet-the-backend/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When people first open the CogniVault repository, the question I hear most is some version of: &lt;em&gt;&amp;ldquo;Where do I even start?&amp;rdquo;&lt;/em&gt; There&amp;rsquo;s a RAG agent, a FAISS index, a DBOS workflow, an Ollama host — and if you&amp;rsquo;re transitioning into tech, every one of those words is a closed door.&lt;/p&gt;
&lt;p&gt;This series opens the doors one at a time. No prior RAG knowledge assumed, every abbreviation spelled out, and every claim checkable against the
. If you&amp;rsquo;ve already read my
, think of this series as the guided tour that should have come first.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s map this out.&lt;/p&gt;
&lt;h2 id="the-whole-app-is-three-processes"&gt;The whole app is three processes&lt;/h2&gt;
&lt;p&gt;CogniVault lets you chat with your own documents and turn them into quizzes, workshops, flashcards, and mindmaps — and nothing ever leaves your machine. (The &lt;em&gt;why&lt;/em&gt; behind that constraint is its own story:
.)&lt;/p&gt;
&lt;p&gt;You might expect an app like that to be a sprawl of microservices. It&amp;rsquo;s three processes:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Process&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;The Python backend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One FastAPI app on port 8000 — it also serves the compiled React frontend as static files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The local model server on port 11434, running the AI models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One Docker container, used &lt;em&gt;only&lt;/em&gt; for workflow checkpoints — never for your documents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Everything else — your files, the search index, your chat history, your quiz scores — is a plain file on disk. That&amp;rsquo;s not laziness; it&amp;rsquo;s the privacy argument made physical. You can open every byte the app stores with a text editor and a SQLite browser.&lt;/p&gt;
&lt;h2 id="the-four-layers"&gt;The four layers&lt;/h2&gt;
&lt;p&gt;Before we name technologies, here&amp;rsquo;s the mental model I want you to keep for the whole series. The backend is four layers, top to bottom:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 1 — the web layer.&lt;/strong&gt; A FastAPI application receives every HTTP request and routes it to one of six routers: chat (&lt;code&gt;/rag&lt;/code&gt;), knowledge management (&lt;code&gt;/upload&lt;/code&gt;, &lt;code&gt;/ingest&lt;/code&gt;), study tools (&lt;code&gt;/api/study/*&lt;/code&gt;), progress (&lt;code&gt;/api/progress/*&lt;/code&gt;), voice (&lt;code&gt;/api/transcribe&lt;/code&gt;), and chat history (&lt;code&gt;/api/history&lt;/code&gt;). FastAPI (a modern Python web framework) also auto-generates interactive API documentation at &lt;code&gt;/api/docs&lt;/code&gt;, which is the best way to explore the backend without reading a line of code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 2 — the intelligence layer.&lt;/strong&gt; Two AI models with two different jobs. &lt;code&gt;gemma4:e4b&lt;/code&gt; &lt;em&gt;generates&lt;/em&gt;: chat answers, reasoning, image analysis, and tool calls. &lt;code&gt;embeddinggemma&lt;/code&gt; &lt;em&gt;embeds&lt;/em&gt;: it turns text into vectors (lists of numbers that capture meaning) so similar ideas can be found mathematically. Both run inside Ollama — think of Ollama as Docker, but for AI models.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 3 — the retrieval layer.&lt;/strong&gt; A search engine over your documents that combines &lt;em&gt;semantic&lt;/em&gt; search (find things that mean the same) with &lt;em&gt;keyword&lt;/em&gt; search (find the exact string). Part 3 of this series is entirely about this layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 4 — the persistence layer.&lt;/strong&gt; Four storage systems, each picked for one job: a FAISS index plus a JSON file for searchable knowledge, SQLite for study data, PostgreSQL for workflow checkpoints, and plain JSON files for chat history.&lt;/p&gt;
&lt;h2 id="one-diagram-every-major-piece"&gt;One diagram, every major piece&lt;/h2&gt;
&lt;div class="mermaid"&gt;flowchart TB
subgraph CLIENT["Browser"]
UI["React Frontend&lt;br/&gt;(compiled, served by FastAPI)"]
end
subgraph SERVER["FastAPI Backend — port 8000"]
ROUTERS["6 Routers&lt;br/&gt;rag · knowledge · study ·&lt;br/&gt;progress · audio · history"]
AGENT["RAG Agent&lt;br/&gt;(Strands SDK, 6 tools)"]
VDB["VectorDB&lt;br/&gt;FAISS + BM25 + RRF"]
INGEST["Ingestion&lt;br/&gt;(DBOS durable workflow)"]
GEN["Study generators&lt;br/&gt;quiz · workshop · cards · mindmap"]
PROG["Progress tracker&lt;br/&gt;+ 25 achievements"]
end
subgraph OLLAMA["Ollama — port 11434"]
GEMMA["gemma4:e4b&lt;br/&gt;chat · thinking · vision · tools"]
EMBED["embeddinggemma&lt;br/&gt;text to vectors"]
end
subgraph STORAGE["Local storage"]
FAISSF["vector_store.faiss + .json"]
SQLITE["progress.db (SQLite)"]
PG["PostgreSQL&lt;br/&gt;workflow state only"]
DOCS["docs/ folder + chat_history.json"]
end
UI --&gt; ROUTERS
ROUTERS --&gt; AGENT --&gt; VDB
AGENT --&gt; GEMMA
VDB --&gt; EMBED
ROUTERS --&gt; INGEST --&gt; EMBED
INGEST --&gt; PG
INGEST --&gt; FAISSF
VDB --- FAISSF
ROUTERS --&gt; GEN --&gt; GEMMA
GEN --&gt; SQLITE
ROUTERS --&gt; PROG --&gt; SQLITE
ROUTERS --&gt; DOCS
&lt;/div&gt;
&lt;p&gt;Keep this picture handy — Parts 2, 3, and 4 each zoom into one region of it.&lt;/p&gt;
&lt;h2 id="the-tech-stack-and-why-each-piece-earned-its-place"&gt;The tech stack, and why each piece earned its place&lt;/h2&gt;
&lt;p&gt;The full dependency list lives in &lt;code&gt;requirements.txt&lt;/code&gt;. Here&amp;rsquo;s what matters, grouped by job:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Serving requests.&lt;/strong&gt; FastAPI defines the endpoints and validates every request and response with Pydantic (a data-validation library — think of it as a strict customs officer for JSON). Uvicorn is the ASGI server (Asynchronous Server Gateway Interface — the Python standard that lets one process juggle many simultaneous requests) that actually runs it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Thinking.&lt;/strong&gt; Ollama serves &lt;code&gt;gemma4:e4b&lt;/code&gt; — the &lt;code&gt;e4b&lt;/code&gt; tag is the roughly four-billion effective-parameter variant, about a 9.6 GB download — and &lt;code&gt;embeddinggemma&lt;/code&gt; (about 622 MB). The agent behaviour is built with the Strands Agents SDK, which wraps the model in a loop where it can call tools, read the results, and only then answer. (Where I run Ollama relative to Docker is a deliberate choice with a story behind it:
.)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Finding things.&lt;/strong&gt; FAISS (Facebook AI Similarity Search — Meta&amp;rsquo;s vector search library) handles semantic lookups; &lt;code&gt;rank-bm25&lt;/code&gt; handles keyword lookups; a formula called Reciprocal Rank Fusion merges the two. Part 3 unpacks all of this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reading documents.&lt;/strong&gt; &lt;code&gt;pypdf&lt;/code&gt; for PDFs, with an OCR fallback (Optical Character Recognition — turning pictures of text into actual text) for scanned pages via &lt;code&gt;pymupdf&lt;/code&gt; and Tesseract. Word, PowerPoint, and Excel each get their own extractor. &lt;code&gt;trafilatura&lt;/code&gt; pulls clean article text out of web pages.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Not losing work.&lt;/strong&gt; DBOS makes the ingestion pipeline durable — every step is checkpointed in PostgreSQL so a crash resumes instead of restarting. Part 2 shows this in action.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Remembering.&lt;/strong&gt; SQLite — a complete database engine that lives in a single file, &lt;code&gt;progress.db&lt;/code&gt; — holds your study sessions, achievements, quizzes, workshops, flashcard decks, and mindmaps.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;p&gt;This series&amp;rsquo; promise is &amp;ldquo;no unexplained abbreviations,&amp;rdquo; so here is the table I wish every technical tutorial shipped with.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Plain-English meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large Language Model&lt;/td&gt;
&lt;td&gt;A neural network trained on huge amounts of text that can read and generate language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;Fetch relevant passages from &lt;em&gt;your&lt;/em&gt; documents first, then let the model answer from them — instead of from its training memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The set of URLs the frontend calls to talk to the backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ASGI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Asynchronous Server Gateway Interface&lt;/td&gt;
&lt;td&gt;The Python standard that lets the server handle many requests concurrently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The universal text format for structured data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NDJSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Newline-Delimited JSON&lt;/td&gt;
&lt;td&gt;A stream where each line is its own JSON object — ideal for streaming AI answers chunk by chunk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;Meta&amp;rsquo;s library for storing vectors and finding the most similar ones fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;A classic keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;A formula for merging multiple ranked result lists using only the ranks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ANN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Approximate Nearest Neighbour&lt;/td&gt;
&lt;td&gt;A speed shortcut many vector databases take. CogniVault deliberately uses an &lt;em&gt;exact&lt;/em&gt; index instead — precise, and plenty fast at personal-library scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database-Oriented Operating System (the research project it grew from)&lt;/td&gt;
&lt;td&gt;A library that checkpoints workflow steps in a database so crashed jobs resume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SQL / SQLite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured Query Language / SQLite&lt;/td&gt;
&lt;td&gt;The language of relational databases / a tiny database that lives in one file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OCR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optical Character Recognition&lt;/td&gt;
&lt;td&gt;Turning pictures of text (scans) into machine-readable text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SHA-256&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secure Hash Algorithm, 256-bit&lt;/td&gt;
&lt;td&gt;A fingerprint function — any file maps to a unique hash, used to detect changed files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CORS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cross-Origin Resource Sharing&lt;/td&gt;
&lt;td&gt;Browser rules controlling which websites may call the API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-Side Request Forgery&lt;/td&gt;
&lt;td&gt;An attack where a server is tricked into fetching internal URLs — the URL-import endpoint guards against it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCQ&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multiple-Choice Question&lt;/td&gt;
&lt;td&gt;One of the two quiz question types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Knowledge Base&lt;/td&gt;
&lt;td&gt;All your ingested, searchable documents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;(Every claim in this series can be checked directly against the
— the relevant file is named whenever it matters, and the repository README maps the full architecture.)&lt;/p&gt;
&lt;h2 id="the-takeaway"&gt;The takeaway&lt;/h2&gt;
&lt;p&gt;Strip away the abbreviations and CogniVault is a small system: one web server, one model runtime, one durability database, and a handful of files. The sophistication isn&amp;rsquo;t in the part count — it&amp;rsquo;s in how a few well-chosen pieces cooperate. That cooperation is what the next three parts are about.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— how a 1,000-page scanned PDF becomes something the AI can search in seconds, and why the pipeline survives a crash at page 800.&lt;/p&gt;</description></item><item><title>Part 8 · Testing a Local-AI App: 351 Tests, Zero Infrastructure</title><link>https://aretascodes.dev/blog/testing-local-ai-app/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/testing-local-ai-app/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
. Previously:
.
All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CogniVault has &lt;strong&gt;351 tests across 22 files&lt;/strong&gt; (at the time of writing — the suite grows with the app). None of them need Ollama. None of them need Postgres. None of them need a real PDF, a microphone, or an internet connection. The whole suite runs in &lt;strong&gt;about three seconds&lt;/strong&gt; on my laptop.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s not because there isn&amp;rsquo;t much to test — the surface is wide. It&amp;rsquo;s because the test suite is built around one principle: &lt;strong&gt;mock at the edge, real everywhere else.&lt;/strong&gt; This post is about what &amp;ldquo;the edge&amp;rdquo; means in a local-AI app, and how to draw the line so the suite stays useful instead of decorative.&lt;/p&gt;
&lt;h2 id="the-22-test-files"&gt;The 22 test files&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;What it covers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_api.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The HTTP endpoints (upload, ingest, RAG, history, KB browsing)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_tools.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Calculator, clock, KB search tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_thinking.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Two-phase stream, thinking tokens, session isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_chat_attachments.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multi-file attach, PDF/DOCX extraction, size limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_chat_memory.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Session history budget, trimming, restart rebuild&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_doc_scope_filter.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-request ContextVar isolation, search filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_doc_tools.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;list_documents&lt;/code&gt;, &lt;code&gt;analyze_document&lt;/code&gt;, &lt;code&gt;compare_documents&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_edit_regenerate.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;History rewind, trim_history_to_turns validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_structure_chunking.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Markdown header splits, CSV row batches, doc types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_ocr_fallback.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OCR trigger threshold, graceful degradation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_new_formats.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PPTX, XLSX, HTML extractors, extension routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_docx_url.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DOCX ingestion and URL import (with the SSRF guard)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_reingest.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SHA-256 change detection, idempotency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_vector_db.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BM25, FAISS, RRF fusion, hybrid search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_audio.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Whisper transcription endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_progress.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sessions, daily aggregation, achievement criteria&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_prompts.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The prompt-template loader and custom overrides&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_vault_stats.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The Privacy Vault Audit numbers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_quiz.py&lt;/code&gt; / &lt;code&gt;test_workshop.py&lt;/code&gt; / &lt;code&gt;test_flashcards.py&lt;/code&gt; / &lt;code&gt;test_mindmaps.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-mode parsing, endpoints, achievements&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Everything that &lt;em&gt;can&lt;/em&gt; be tested in isolation is tested in isolation. Everything that needs to be tested through the FastAPI layer is, but the &lt;em&gt;only&lt;/em&gt; things mocked are the calls that cross the process boundary.&lt;/p&gt;
&lt;h2 id="what-gets-mocked-what-doesnt"&gt;What gets mocked, what doesn&amp;rsquo;t&lt;/h2&gt;
&lt;p&gt;The single most important question in a project like this: where do you stub?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;React&lt;/span&gt; &lt;span class="n"&gt;frontend&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt; &lt;span class="n"&gt;handlers&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;tested&lt;/span&gt; &lt;span class="n"&gt;directly&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="n"&gt;TestClient&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;tested&lt;/span&gt; &lt;span class="n"&gt;directly&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rag_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generators&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;├─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;BM25&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;real&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fast&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;├─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;SQLite&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;real&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;against&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;tmp_path&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;├─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;DBOS&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;patched&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt; &lt;span class="n"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;no&lt;/span&gt; &lt;span class="n"&gt;Postgres&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;├─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;Ollama&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;patched&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;s import site&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="err"&gt;└─►&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;Whisper&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;←─&lt;/span&gt; &lt;span class="n"&gt;stubbed&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt; &lt;span class="mi"&gt;145&lt;/span&gt; &lt;span class="n"&gt;MB&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The rule of thumb: &lt;strong&gt;anything that crosses a process or network boundary, mock. Anything in-process, run for real.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;FAISS and BM25 are real because they&amp;rsquo;re libraries we link into the test process. SQLite is real because it&amp;rsquo;s a file. DBOS is patched because launching it expects a Postgres connection, and that&amp;rsquo;s network. Ollama is patched because it&amp;rsquo;s HTTP. Whisper is stubbed because loading a 145 MB model in a unit test is silly.&lt;/p&gt;
&lt;p&gt;That principle keeps the test suite fast (no I/O the OS can&amp;rsquo;t handle in milliseconds) and meaningful (the real code paths through retrieval, chunking, parsing, scope filtering all execute).&lt;/p&gt;
&lt;h2 id="mocking-ollama"&gt;Mocking Ollama&lt;/h2&gt;
&lt;p&gt;Most CogniVault tests need &lt;em&gt;some&lt;/em&gt; model output, but they don&amp;rsquo;t care what model produced it. Each service imports the &lt;code&gt;ollama&lt;/code&gt; module directly, so the tests patch that reference &lt;strong&gt;at the service&amp;rsquo;s own import site&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Real pattern from test_quiz.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;unittest.mock&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;patch&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;backend.services&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;quiz_generator&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_quiz_parses_questions&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;fake&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;message&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;questions&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;VALID_MCQ&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;})}}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;patch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quiz_generator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;ollama&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mock_ollama&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mock_ollama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;return_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quiz_generator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate_quiz&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;difficulty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;beginner&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_questions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question_types&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;mcq&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;questions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A streaming variant feeds chunk sequences instead of a single response, used by the RAG and thinking tests. The key property: one &lt;code&gt;patch.object&lt;/code&gt; against the module the service actually uses. No deep mock hierarchies, no fragile string paths into third-party internals. Easy to read in a code review, easy to debug when a test fails.&lt;/p&gt;
&lt;h2 id="mocking-dbos"&gt;Mocking DBOS&lt;/h2&gt;
&lt;p&gt;DBOS expects &lt;code&gt;launch()&lt;/code&gt; to connect to Postgres. The shared &lt;code&gt;client&lt;/code&gt; fixture in &lt;code&gt;conftest.py&lt;/code&gt; simply patches the &lt;code&gt;dbos&lt;/code&gt; instance before the app is exercised:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Real pattern from conftest.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&amp;#34;A FastAPI TestClient with DBOS launch mocked out — no Postgres needed.&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;backend.services.ingest.dbos&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mock_dbos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;mock_dbos&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;launch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MagicMock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;backend.main&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;TestClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The decorated workflow steps still execute as ordinary Python functions — we lose the durability semantics, but the tests aren&amp;rsquo;t testing durability, they&amp;rsquo;re testing the &lt;em&gt;business logic inside the steps&lt;/em&gt; (hash detection, extraction, chunking). The durability layer has its own tests upstream, in DBOS&amp;rsquo;s own suite.&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s a second isolation layer that runs on &lt;strong&gt;every&lt;/strong&gt; test automatically: an autouse fixture points the docs folder, FAISS index, and metadata file at a per-test &lt;code&gt;tmp_path&lt;/code&gt; via environment variables, so no test can ever touch real data on disk.&lt;/p&gt;
&lt;h2 id="real-sqlite-with-one-override"&gt;Real SQLite, with one override&lt;/h2&gt;
&lt;p&gt;Progress tracking, achievements, quiz storage, deck CRUD — all SQLite. The progress tracker exposes a single test seam: a module-level path override.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Real pattern from test_quiz.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;autouse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_isolate_progress_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;monkeypatch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;monkeypatch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;setattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;progress_tracker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;_db_path_override&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp_path&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;progress_test.db&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Every test gets a fresh database file; the schema auto-creates on first use. No connection pooling drama, no leaked state between tests, no in-memory &lt;code&gt;:memory:&lt;/code&gt; gymnastics. Just a temp file per test.&lt;/p&gt;
&lt;p&gt;This is the kind of test that catches bugs an SQL-level mock would never see — a missing index, a botched migration, a constraint that doesn&amp;rsquo;t fire. SQLite is fast enough on every machine I&amp;rsquo;ve ever owned that &amp;ldquo;use the real database&amp;rdquo; isn&amp;rsquo;t even a trade-off.&lt;/p&gt;
&lt;h2 id="the-testclient-pattern"&gt;The TestClient pattern&lt;/h2&gt;
&lt;p&gt;For HTTP tests, FastAPI&amp;rsquo;s &lt;code&gt;TestClient&lt;/code&gt; runs the app in-process. The upload, the validation, the chunking, the vector-store update, the response serialisation — every layer runs for real. Only the calls that would leave the process (the Ollama embedding call inside ingestion, the model call inside generation) are patched. That&amp;rsquo;s the right line: the test verifies the &lt;em&gt;integration&lt;/em&gt; of those layers, but doesn&amp;rsquo;t depend on an external service.&lt;/p&gt;
&lt;p&gt;The streaming endpoint tests use a slightly different style — they iterate the response body and parse each &lt;strong&gt;NDJSON&lt;/strong&gt; line (one JSON envelope per line, as described in
) — but the principle is identical.&lt;/p&gt;
&lt;h2 id="coverage-gaps-i-accept"&gt;Coverage gaps I accept&lt;/h2&gt;
&lt;p&gt;Three things the test suite &lt;em&gt;doesn&amp;rsquo;t&lt;/em&gt; cover:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The frontend.&lt;/strong&gt; No React testing in this suite — that&amp;rsquo;s a separate concern. Most failures show up in API tests anyway, because the frontend is a thin client over a typed API.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real Ollama prompt quality.&lt;/strong&gt; Whether &lt;code&gt;gemma4:e4b&lt;/code&gt; actually produces &lt;em&gt;useful&lt;/em&gt; quiz questions is not a thing tests can answer. That&amp;rsquo;s evaluation, not testing. It belongs in a separate harness with a real model running.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Race conditions across DBOS workflow restarts.&lt;/strong&gt; The resume path is exercised at the logic level, but the full state space of &amp;ldquo;what happens if Postgres goes away at this exact instant&amp;rdquo; is too large to enumerate.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These are conscious gaps. The test suite is for catching regressions in code I wrote; it&amp;rsquo;s not a replacement for evaluation, integration testing, or actual chaos engineering.&lt;/p&gt;
&lt;h2 id="what-the-suite-is-actually-for"&gt;What the suite is actually for&lt;/h2&gt;
&lt;p&gt;Two things, in order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Refactor confidence.&lt;/strong&gt; When I rip out the agent loop and put a new one in, do the tests still pass? If yes, the API contracts I care about haven&amp;rsquo;t drifted.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PR review surface.&lt;/strong&gt; Every PR runs the suite in CI. A green run is a precondition for merge. The suite is loud enough that a real regression makes the noise.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Notice what it &lt;em&gt;isn&amp;rsquo;t&lt;/em&gt; for: proving the model works. It can&amp;rsquo;t. Tests can pin behaviour but they can&amp;rsquo;t pin quality. That&amp;rsquo;s a different muscle, and it belongs in a different harness.&lt;/p&gt;
&lt;h2 id="whats-worth-borrowing"&gt;What&amp;rsquo;s worth borrowing&lt;/h2&gt;
&lt;p&gt;If you&amp;rsquo;re building a local-AI app and your tests need Ollama running:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Patch the &lt;code&gt;ollama&lt;/code&gt; module at each service&amp;rsquo;s import site with &lt;code&gt;patch.object(service_module, &amp;quot;ollama&amp;quot;)&lt;/code&gt; — one seam per service, no shims required.&lt;/li&gt;
&lt;li&gt;Give your DB layer a path override and run against a &lt;code&gt;tmp_path&lt;/code&gt; SQLite file.&lt;/li&gt;
&lt;li&gt;Use an autouse fixture to redirect every on-disk artefact (docs folder, index files) to &lt;code&gt;tmp_path&lt;/code&gt;, so no test can touch real data even by accident.&lt;/li&gt;
&lt;li&gt;For each external service (model, audio, workflow engine), draw the seam at the process boundary. Test everything above it with real code.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is a suite where every test runs in any environment, finishes in milliseconds, and exercises the actual integration of every layer of code you wrote. 351 tests in about three seconds isn&amp;rsquo;t an optimisation, it&amp;rsquo;s a side-effect of mocking only at the edges.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Continuous Integration&lt;/td&gt;
&lt;td&gt;Automatically running the test suite on every push/PR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pull Request&lt;/td&gt;
&lt;td&gt;A proposed code change — merged only when the suite is green&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The HTTP surface the TestClient exercises in-process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HyperText Transfer Protocol&lt;/td&gt;
&lt;td&gt;The protocol the (in-process) endpoint tests speak&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;The retrieval-then-answer pipeline under test&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Knowledge Base&lt;/td&gt;
&lt;td&gt;The indexed document collection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;Real in tests — it&amp;rsquo;s an in-process library&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;The keyword index — also real in tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;The rank-merging formula covered by &lt;code&gt;test_vector_db.py&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SQLite / SQL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;(SQL = Structured Query Language)&lt;/td&gt;
&lt;td&gt;The real, file-based database every progress test runs against&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database-Oriented Operating System&lt;/td&gt;
&lt;td&gt;The durable-workflow library — patched so no Postgres is needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OCR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optical Character Recognition&lt;/td&gt;
&lt;td&gt;The scanned-PDF fallback with its own trigger-threshold tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-Side Request Forgery&lt;/td&gt;
&lt;td&gt;The URL-import attack class covered in &lt;code&gt;test_docx_url.py&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NDJSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Newline-Delimited JSON&lt;/td&gt;
&lt;td&gt;The streaming format the endpoint tests parse line by line&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SHA-256&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secure Hash Algorithm, 256-bit&lt;/td&gt;
&lt;td&gt;The content fingerprint behind the re-ingest tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CRUD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Create, Read, Update, Delete&lt;/td&gt;
&lt;td&gt;The basic storage operations for decks, quizzes, and maps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF / DOCX / PPTX / XLSX / HTML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Portable Document Format / Word / PowerPoint / Excel / HyperText Markup Language&lt;/td&gt;
&lt;td&gt;The extractor formats with dedicated tests&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;That&amp;rsquo;s the series. Eight posts on the parts of
I&amp;rsquo;m most proud of — and a handful I&amp;rsquo;d build differently. If any of it was useful to you, the code is open source at
, and the
is on YouTube.&lt;/p&gt;
&lt;p&gt;Your data. Your hardware. Your AI. Your vault.&lt;/p&gt;</description></item><item><title>Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts</title><link>https://aretascodes.dev/blog/two-phase-streaming-strands-agents/</link><pubDate>Thu, 30 Apr 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/two-phase-streaming-strands-agents/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
. Previously:
.
All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When I first wired up Gemma 4 with
inside CogniVault, the chat felt slow. Not laggy — slow in a way that&amp;rsquo;s worse than laggy. The user types a question. The cursor sits there. Then, eventually, an answer drops out of the void.&lt;/p&gt;
&lt;p&gt;The model wasn&amp;rsquo;t idle. It was &lt;em&gt;thinking&lt;/em&gt;. Gemma 4 has a chain-of-thought mode that produces a (sometimes long) reasoning trace before its final reply. With a single-phase agent stream, all of that thinking is happening &lt;em&gt;inside the agent loop&lt;/em&gt; — silently — before any tool calls run, before any tokens get emitted to the UI.&lt;/p&gt;
&lt;p&gt;So I split the call into two phases.&lt;/p&gt;
&lt;h2 id="the-shape"&gt;The shape&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;POST /rag
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ├── Phase 1 — Direct Ollama call, thinking enabled
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ stream: {&amp;#34;type&amp;#34;:&amp;#34;thinking&amp;#34;,&amp;#34;data&amp;#34;:&amp;#34;...&amp;#34;} (reasoning tokens)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── Phase 2 — Strands Agent (thinking disabled)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;metadata&amp;#34;,&amp;#34;data&amp;#34;:{...}} (citations, as soon as search runs)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;text&amp;#34;,&amp;#34;data&amp;#34;:&amp;#34;...&amp;#34;} (answer tokens)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;memory&amp;#34;,&amp;#34;data&amp;#34;:{...}} (end-of-stream: session memory usage)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The endpoint streams &lt;strong&gt;newline-delimited JSON&lt;/strong&gt; (NDJSON): each line of the response body is one self-contained JSON envelope with a &lt;code&gt;type&lt;/code&gt; and a &lt;code&gt;data&lt;/code&gt;. The frontend dispatches on &lt;code&gt;type&lt;/code&gt; and renders accordingly: a &lt;strong&gt;collapsible reasoning panel&lt;/strong&gt; for the thinking tokens, the main message bubble for the text tokens, a sidebar card per citation.&lt;/p&gt;
&lt;p&gt;The user sees the model start thinking &lt;em&gt;immediately&lt;/em&gt;. Latency to first byte drops from &amp;ldquo;long enough to wonder if it crashed&amp;rdquo; to &amp;ldquo;instant.&amp;rdquo; Total time to final answer doesn&amp;rsquo;t change. Perceived speed does.&lt;/p&gt;
&lt;h2 id="phase-1--thinking-only"&gt;Phase 1 — Thinking only&lt;/h2&gt;
&lt;p&gt;Phase 1 is a single direct call to Ollama with thinking enabled. It gets exactly what Phase 2 will see — the same system prompt, the current question, and any attached images — so the reasoning reflects reality. Only the &lt;em&gt;reasoning&lt;/em&gt; tokens are consumed; whatever answer text Phase 1 starts to produce is discarded, because we don&amp;rsquo;t want a half-formed answer competing with the real one.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified from backend/services/rag_agent.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ollama_host&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;system&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;user&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;images&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Phase 1 is deliberately &lt;strong&gt;best-effort&lt;/strong&gt;: any failure here is swallowed and logged, and the stream moves straight on to Phase 2. A broken reasoning panel should never cost the user their answer.&lt;/p&gt;
&lt;h2 id="phase-2--agent-with-tools"&gt;Phase 2 — Agent with tools&lt;/h2&gt;
&lt;p&gt;Phase 2 builds a &lt;strong&gt;fresh Strands &lt;code&gt;Agent&lt;/code&gt; per request&lt;/strong&gt; — no shared mutable state between concurrent chats — restores the session&amp;rsquo;s conversation history into it, and runs the tool loop with six tools registered:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_knowledge_base(query)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hybrid FAISS + BM25 search, top-7, RRF fusion. Scope-filter-aware.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list_documents()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inventory of every indexed file with type and chunk count.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;analyze_document(filename)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inner Gemma call → structured summary (topics, entities, key facts).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compare_documents(doc_a, doc_b, question)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inner Gemma call answering across two documents.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;calculator(expression)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe AST evaluator — no &lt;code&gt;eval()&lt;/code&gt;, no arbitrary code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;current_time()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Timestamp for time-aware queries.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The agent decides which tools to call and in what order. There&amp;rsquo;s no hard-coded router; the system prompt explains what&amp;rsquo;s available and Strands handles the loop. For most document questions the path is: &lt;code&gt;search_knowledge_base&lt;/code&gt; → answer. For comparisons: &lt;code&gt;compare_documents&lt;/code&gt; → answer. For &amp;ldquo;what files do I have?&amp;rdquo;: &lt;code&gt;list_documents&lt;/code&gt; → answer. For greetings and arithmetic, the system prompt tells the agent it may skip search entirely. The model picks.&lt;/p&gt;
&lt;p&gt;Two details that took debugging to get right:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phase 2 runs with thinking explicitly disabled.&lt;/strong&gt; Without that flag, Gemma&amp;rsquo;s default behaviour can leak &lt;code&gt;&amp;lt;think&amp;gt;…&amp;lt;/think&amp;gt;&lt;/code&gt; tags into the visible answer, and everything before the closing tag gets swallowed by the Markdown renderer. One model option — &lt;code&gt;options={&amp;quot;thinking&amp;quot;: False}&lt;/code&gt; — fixed a &amp;ldquo;truncated responses&amp;rdquo; bug that looked much scarier than it was.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Citations are flushed before the first answer token.&lt;/strong&gt; Tools run before text deltas arrive, so by the time the first visible token streams, every source the search found is already in the sidebar. The accumulator is a request-local &lt;code&gt;ContextVar&lt;/code&gt; the search tool appends to.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified — the real loop reads Strands&amp;#39; raw event dicts&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stream_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;event&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;contentBlockDelta&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;delta&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new_citations&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="c1"&gt;# drain the ContextVar accumulator&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="why-this-matters-more-than-it-sounds"&gt;Why this matters more than it sounds&lt;/h2&gt;
&lt;p&gt;You could implement similar behaviour with one agent call that interleaves &lt;code&gt;thinking&lt;/code&gt; events with &lt;code&gt;text&lt;/code&gt; events. The reasons I split it anyway:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The thinking model and the tool model can be different.&lt;/strong&gt; Right now they&amp;rsquo;re both &lt;code&gt;gemma4:e4b&lt;/code&gt;, but the architecture lets me swap a smaller, faster model in for Phase 1 reasoning and keep the big one for Phase 2 tool use. I&amp;rsquo;m not doing that yet — but I want the option.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Phase 1 always streams immediately.&lt;/strong&gt; A pure agent loop only starts producing tokens after the model has decided what to say. Two-phase guarantees the user sees activity almost as soon as they press Enter, regardless of how complex the Phase 2 tool work gets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Failures isolate.&lt;/strong&gt; If Phase 2 falls over (Ollama timeout, tool error), Phase 1&amp;rsquo;s reasoning is still visible — the user can see &lt;em&gt;what the model was trying to do&lt;/em&gt;, which makes the error far less frustrating than a blank &amp;ldquo;something went wrong.&amp;rdquo;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="contextvar-isolation-again"&gt;ContextVar isolation, again&lt;/h2&gt;
&lt;p&gt;The same &lt;code&gt;ContextVar&lt;/code&gt; trick that scopes retrieval in
carries here. At the start of each &lt;code&gt;/rag&lt;/code&gt; stream, the handler sets two request-local variables: the &lt;strong&gt;document-scope filter&lt;/strong&gt; and the &lt;strong&gt;citation accumulator&lt;/strong&gt;. The agent&amp;rsquo;s tools read and write them implicitly. Conversation history itself lives in a per-session store guarded by per-session &lt;code&gt;asyncio&lt;/code&gt; locks, so two concurrent requests in the same chat can&amp;rsquo;t corrupt each other either.&lt;/p&gt;
&lt;p&gt;Tested with two browser tabs open on the same backend, scoped to different document categories, sending overlapping queries simultaneously. Zero cross-contamination. The test suite covers this explicitly in &lt;code&gt;test_thinking.py&lt;/code&gt; and &lt;code&gt;test_doc_scope_filter.py&lt;/code&gt; — see
for the broader story.&lt;/p&gt;
&lt;h2 id="the-frontend-side-of-the-contract"&gt;The frontend side of the contract&lt;/h2&gt;
&lt;p&gt;A detail that tripped me up: this is a &lt;code&gt;POST&lt;/code&gt; endpoint, so the browser&amp;rsquo;s &lt;code&gt;EventSource&lt;/code&gt; API (which only does GET) is out. The frontend uses &lt;code&gt;fetch&lt;/code&gt; and reads the response body incrementally, splitting on newlines and parsing each line as JSON:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-tsx" data-lang="tsx"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// Simplified from useRagStream.ts
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;/rag&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;POST&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;body&lt;/span&gt;: &lt;span class="kt"&gt;JSON.stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getReader&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;decoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;TextDecoder&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;read&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;decoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;: &lt;span class="kt"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;\n&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// keep the trailing partial line
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;line&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kr"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;line&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;appendThinking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;appendText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;addCitation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;memory&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;updateMemoryMeter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The reasoning panel starts &lt;strong&gt;collapsed&lt;/strong&gt;, with a small pulsing indicator while thinking tokens are still streaming — enough to signal &amp;ldquo;the model is working&amp;rdquo; without shoving a wall of chain-of-thought at the user. One click expands the full trace, during or after the stream.&lt;/p&gt;
&lt;h2 id="what-id-revisit"&gt;What I&amp;rsquo;d revisit&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 reasons toward a full answer, and we throw the answer part away.&lt;/strong&gt; A dedicated &amp;ldquo;plan your approach, don&amp;rsquo;t answer yet&amp;rdquo; prompt for Phase 1 would make the reasoning trace tighter and cheaper. Today it shares the main system prompt — simpler, but the trace can ramble.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No interrupt yet.&lt;/strong&gt; Once Phase 1 starts, it runs to completion. If the user types a follow-up mid-stream we let it finish. A real cancel button would mean wiring an abort signal through Ollama&amp;rsquo;s HTTP client — feasible, not yet done.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 occasionally over-thinks.&lt;/strong&gt; Greetings and trivial questions still produce a paragraph of reasoning. A &amp;ldquo;should I think?&amp;rdquo; gate (probably a tiny classifier or even a heuristic on query length) would skip Phase 1 entirely for those cases.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="takeaway"&gt;Takeaway&lt;/h2&gt;
&lt;p&gt;Streaming is &lt;em&gt;not&lt;/em&gt; just an optimisation. It&amp;rsquo;s a UX primitive. Two-phase streaming buys you a free property: the &lt;em&gt;visible&lt;/em&gt; part of the interaction starts before the &lt;em&gt;slow&lt;/em&gt; part does. The user gets to watch the model think, which is — genuinely — more interesting than watching a spinner.&lt;/p&gt;
&lt;p&gt;If your agent app feels slow even though the answers are fast, look at &lt;em&gt;when&lt;/em&gt; tokens start flowing. The fix often isn&amp;rsquo;t a faster model.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NDJSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Newline-Delimited JSON&lt;/td&gt;
&lt;td&gt;A stream where each line is its own complete JSON object — what &lt;code&gt;/rag&lt;/code&gt; emits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The universal text format for structured data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User Experience&lt;/td&gt;
&lt;td&gt;How the product feels to use — the real beneficiary of two-phase streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User Interface&lt;/td&gt;
&lt;td&gt;The visible surface the stream renders into&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;The dense half of hybrid retrieval (previous post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;The keyword half of hybrid retrieval (previous post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;The rank-only formula that merges the two result lists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstract Syntax Tree&lt;/td&gt;
&lt;td&gt;The parsed form of an expression — how the calculator evaluates maths without &lt;code&gt;eval()&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HyperText Transfer Protocol&lt;/td&gt;
&lt;td&gt;The protocol carrying the stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-Sent Events&lt;/td&gt;
&lt;td&gt;The browser&amp;rsquo;s built-in GET-only streaming format — notably &lt;em&gt;not&lt;/em&gt; usable here, because &lt;code&gt;/rag&lt;/code&gt; is a POST&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The boundary the frontend calls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— how CogniVault re-ingests edited PDFs without re-embedding everything, and survives a kill -9 mid-pipeline.&lt;/p&gt;</description></item></channel></rss>