<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Agents |</title><link>https://aretascodes.dev/tags/agents/</link><atom:link href="https://aretascodes.dev/tags/agents/index.xml" rel="self" type="application/rss+xml"/><description>Agents</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 12 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://aretascodes.dev/media/icon_hu_2ab4f4763b27c75b.png</url><title>Agents</title><link>https://aretascodes.dev/tags/agents/</link></image><item><title>CogniVault Backend Explained, Part 3 · How a Question Becomes a Cited Answer</title><link>https://aretascodes.dev/blog/backend-explained-rag-agent/</link><pubDate>Fri, 12 Jun 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/backend-explained-rag-agent/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You type a question. A few seconds later you get an answer with footnotes — the exact documents and pages it came from. This part walks through everything that happens in between.&lt;/p&gt;
&lt;p&gt;In
we built the knowledge base: every document chunked, embedded, and indexed. Now we get to &lt;em&gt;use&lt;/em&gt; it — and this is where CogniVault stops being a pipeline and starts being interesting.&lt;/p&gt;
&lt;h2 id="two-librarians-because-one-keeps-failing-you"&gt;Two librarians, because one keeps failing you&lt;/h2&gt;
&lt;p&gt;Imagine a library with one librarian who organises everything by &lt;em&gt;vibe&lt;/em&gt;. Ask her about &amp;ldquo;server downtime procedures&amp;rdquo; and she&amp;rsquo;s brilliant — she understands what you mean and finds documents that discuss the concept, whatever words they use. But ask her for &amp;ldquo;Error Code 404B&amp;rdquo; and she shrugs, handing you general networking guides. She doesn&amp;rsquo;t do exact strings.&lt;/p&gt;
&lt;p&gt;Down the hall is a second librarian with a card catalogue. He finds the exact string &amp;ldquo;404B&amp;rdquo; instantly — but ask him a conceptual question phrased differently from the source text, and he finds nothing at all.&lt;/p&gt;
&lt;p&gt;These are the two halves of search:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Semantic search (FAISS)&lt;/strong&gt; — your question is embedded into a vector, and the index finds chunks whose vectors point the same way (technically: cosine similarity — how closely two arrows align). Great for meaning, blind to exact identifiers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keyword search (BM25)&lt;/strong&gt; — a scoring formula that rewards chunks containing your &lt;em&gt;exact&lt;/em&gt; words, weighted by how distinctive those words are. Great for identifiers, blind to synonyms.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;CogniVault asks &lt;strong&gt;both librarians every time&lt;/strong&gt;, then merges their answers with &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt; — a formula that combines ranked lists using only the &lt;em&gt;positions&lt;/em&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;score(chunk) = sum over both lists of 1 / (60 + rank)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A chunk ranked highly by either librarian scores well; a chunk both of them liked floats to the top. The elegance is what&amp;rsquo;s &lt;em&gt;missing&lt;/em&gt;: you never have to reconcile FAISS&amp;rsquo;s similarity scores with BM25&amp;rsquo;s completely different scale, because ranks are the only input. The constant 60 comes straight from the original 2009 research paper, and yes, it&amp;rsquo;s cited in the code.&lt;/p&gt;
&lt;p&gt;A few implementation details worth knowing: both searches deliberately over-fetch (at least 20 candidates each) so the fusion has material to work with; very weak semantic matches are dropped, but a keyword-perfect chunk can still be rescued through fusion; and the final answer uses the top 7 chunks. I benchmarked this whole setup against pure vector search in
if you want the war stories.&lt;/p&gt;
&lt;h2 id="the-agent-a-model-that-decides-for-itself"&gt;The agent: a model that decides for itself&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s the second idea that trips up beginners: CogniVault&amp;rsquo;s chat is not &amp;ldquo;paste chunks into a prompt, get an answer.&amp;rdquo; It&amp;rsquo;s an &lt;strong&gt;agent&lt;/strong&gt; — a model running in a loop where it can &lt;em&gt;choose&lt;/em&gt; to call tools, read their results, and only then answer.&lt;/p&gt;
&lt;p&gt;Built with the Strands Agents SDK, the agent gets six tools:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_knowledge_base&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The core RAG tool — runs the hybrid search above, returns chunks with source and page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list_documents&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;See what&amp;rsquo;s in the vault&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;analyze_document&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Structured analysis of one document: topics, entities, facts, summary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compare_documents&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Answer a question by comparing two documents side by side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;calculator&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe maths — the expression is parsed into a syntax tree and only whitelisted operators run. No &lt;code&gt;eval()&lt;/code&gt;, ever&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;current_time&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The date and time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There is no hard-coded routing. The &lt;em&gt;model&lt;/em&gt; reads your question and decides which tools to call, guided by its system prompt. Ask &amp;ldquo;compare the two contracts on termination clauses&amp;rdquo; and it reaches for &lt;code&gt;compare_documents&lt;/code&gt;; ask &amp;ldquo;what&amp;rsquo;s 15% of 2,340&amp;rdquo; and it uses the calculator instead of hallucinating arithmetic.&lt;/p&gt;
&lt;p&gt;Two safety details I want beginners to notice, because they&amp;rsquo;re the difference between a toy and a product: a &lt;strong&gt;fresh agent is constructed for every request&lt;/strong&gt; (no shared state bleeding between concurrent chats), and the document-analysis tools call the model &lt;em&gt;directly&lt;/em&gt; rather than through the agent — otherwise an agent calling a tool that calls the agent could recurse forever.&lt;/p&gt;
&lt;h2 id="watching-the-model-think"&gt;Watching the model think&lt;/h2&gt;
&lt;p&gt;When you send a message, the response streams back as &lt;strong&gt;NDJSON&lt;/strong&gt; (Newline-Delimited JSON — each line of the stream is its own small JSON object). And it arrives in two phases:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phase 1 — thinking.&lt;/strong&gt; Gemma&amp;rsquo;s reasoning chain streams first, rendered in the collapsible panel above the answer. It&amp;rsquo;s deliberately best-effort: if it fails for any reason, the answer still comes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Phase 2 — the agent answer.&lt;/strong&gt; Tools run, citations appear in the Sources panel the moment the search completes — &lt;em&gt;before&lt;/em&gt; the answer finishes writing — and the answer text streams in.&lt;/p&gt;
&lt;div class="mermaid"&gt;flowchart TB
Q["Your question&lt;br/&gt;(plus optional images, files, scope)"] --&gt; P1
subgraph STREAM["POST /rag — one NDJSON stream"]
P1["Phase 1: Thinking&lt;br/&gt;reasoning chunks stream first"]
P1 --&gt; P2["Phase 2: Agent&lt;br/&gt;fresh per request, history restored"]
P2 --&gt;|"decides to call"| T["search_knowledge_base"]
T --&gt; D["FAISS&lt;br/&gt;semantic"]
T --&gt; S["BM25&lt;br/&gt;keywords"]
D --&gt; RRF["RRF fusion — top 7 chunks"]
S --&gt; RRF
RRF --&gt;|"chunks + citations"| P2
P2 --&gt; OUT["citations, then answer text,&lt;br/&gt;then a memory-usage report"]
end
&lt;/div&gt;
&lt;p&gt;Each line in the stream is typed: &lt;code&gt;thinking&lt;/code&gt;, &lt;code&gt;metadata&lt;/code&gt; (a citation), &lt;code&gt;text&lt;/code&gt; (answer), &lt;code&gt;memory&lt;/code&gt; (how full the conversation budget is), or &lt;code&gt;error&lt;/code&gt;. The frontend just reads lines and routes them to the right panel. I dissected this design — and why thinking comes &lt;em&gt;before&lt;/em&gt; the tool calls — in
.&lt;/p&gt;
&lt;h2 id="a-memory-budget-not-a-bottomless-pit"&gt;A memory budget, not a bottomless pit&lt;/h2&gt;
&lt;p&gt;Gemma&amp;rsquo;s context window (the amount of text the model can consider at once) is 128K tokens, but CogniVault doesn&amp;rsquo;t let conversation history sprawl across all of it. Each chat session gets a budget of 48,000 characters — roughly 12,000 tokens. Exceed it, and the &lt;em&gt;oldest&lt;/em&gt; question-answer pair quietly drops out first, keeping the bulk of the window free for what matters: your current question and the retrieved chunks.&lt;/p&gt;
&lt;p&gt;Two resilience touches worth stealing for your own projects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Restart survival.&lt;/strong&gt; In-memory history dies with the process. So the first message in a session after a backend restart rebuilds its history from the chat log the frontend persists. Multi-turn memory survives reboots.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Edit and regenerate.&lt;/strong&gt; Editing an earlier message rewinds the stored history to that point before re-asking — the model genuinely forgets the timeline that no longer exists.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="scope-pinning-the-ai-to-specific-documents"&gt;Scope: pinning the AI to specific documents&lt;/h2&gt;
&lt;p&gt;One last feature, and a lesson about small local models. You can pin a chat to specific files or a category. The filter travels with the request &lt;em&gt;and&lt;/em&gt; a mandatory-search instruction is injected into both the system prompt and the user message itself.&lt;/p&gt;
&lt;p&gt;Why both? Because small models sometimes skip instructions that live only in the system prompt — but they can&amp;rsquo;t ignore what&amp;rsquo;s inside the question. Belt and braces. When you work with 4-billion-parameter models instead of frontier ones, you learn to make instructions impossible to miss rather than hoping they&amp;rsquo;re followed.&lt;/p&gt;
&lt;h2 id="the-takeaway"&gt;The takeaway&lt;/h2&gt;
&lt;p&gt;A cited answer is four systems cooperating: two retrievers covering each other&amp;rsquo;s blind spots, a fusion formula that needs nothing but ranks, an agent that picks its own tools, and a stream that shows its work. None of the four is exotic on its own — the product is the cooperation.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;Retrieve relevant passages from your own documents first; let the model answer from them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;The semantic (meaning-based) half of hybrid search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;The keyword half — a classic ranking formula from the Okapi information-retrieval system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;Merges the two ranked lists using only each chunk&amp;rsquo;s rank: &lt;code&gt;score = Σ 1/(60 + rank)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NDJSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Newline-Delimited JSON&lt;/td&gt;
&lt;td&gt;A stream where each line is its own complete JSON object — the chat response format&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The universal text format for structured data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstract Syntax Tree&lt;/td&gt;
&lt;td&gt;The parsed form of an expression — how the calculator does maths without &lt;code&gt;eval()&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large Language Model&lt;/td&gt;
&lt;td&gt;A neural network trained on huge amounts of text that can read and generate language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Software Development Kit&lt;/td&gt;
&lt;td&gt;A library of building blocks — here, Strands, which provides the agent loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;K&lt;/strong&gt; (in 128K)&lt;/td&gt;
&lt;td&gt;Kilo (thousand)&lt;/td&gt;
&lt;td&gt;128K tokens ≈ 128,000 tokens — Gemma&amp;rsquo;s context window&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— the same machinery pointed at generating quizzes, workshops, flashcards, and mindmaps, plus a table of every byte the app stores and exactly where it lives.&lt;/p&gt;</description></item><item><title>Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts</title><link>https://aretascodes.dev/blog/two-phase-streaming-strands-agents/</link><pubDate>Thu, 30 Apr 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/two-phase-streaming-strands-agents/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
. Previously:
.
All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When I first wired up Gemma 4 with
inside CogniVault, the chat felt slow. Not laggy — slow in a way that&amp;rsquo;s worse than laggy. The user types a question. The cursor sits there. Then, eventually, an answer drops out of the void.&lt;/p&gt;
&lt;p&gt;The model wasn&amp;rsquo;t idle. It was &lt;em&gt;thinking&lt;/em&gt;. Gemma 4 has a chain-of-thought mode that produces a (sometimes long) reasoning trace before its final reply. With a single-phase agent stream, all of that thinking is happening &lt;em&gt;inside the agent loop&lt;/em&gt; — silently — before any tool calls run, before any tokens get emitted to the UI.&lt;/p&gt;
&lt;p&gt;So I split the call into two phases.&lt;/p&gt;
&lt;h2 id="the-shape"&gt;The shape&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;POST /rag
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ├── Phase 1 — Direct Ollama call, thinking enabled
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ stream: {&amp;#34;type&amp;#34;:&amp;#34;thinking&amp;#34;,&amp;#34;data&amp;#34;:&amp;#34;...&amp;#34;} (reasoning tokens)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── Phase 2 — Strands Agent (thinking disabled)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;metadata&amp;#34;,&amp;#34;data&amp;#34;:{...}} (citations, as soon as search runs)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;text&amp;#34;,&amp;#34;data&amp;#34;:&amp;#34;...&amp;#34;} (answer tokens)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;memory&amp;#34;,&amp;#34;data&amp;#34;:{...}} (end-of-stream: session memory usage)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The endpoint streams &lt;strong&gt;newline-delimited JSON&lt;/strong&gt; (NDJSON): each line of the response body is one self-contained JSON envelope with a &lt;code&gt;type&lt;/code&gt; and a &lt;code&gt;data&lt;/code&gt;. The frontend dispatches on &lt;code&gt;type&lt;/code&gt; and renders accordingly: a &lt;strong&gt;collapsible reasoning panel&lt;/strong&gt; for the thinking tokens, the main message bubble for the text tokens, a sidebar card per citation.&lt;/p&gt;
&lt;p&gt;The user sees the model start thinking &lt;em&gt;immediately&lt;/em&gt;. Latency to first byte drops from &amp;ldquo;long enough to wonder if it crashed&amp;rdquo; to &amp;ldquo;instant.&amp;rdquo; Total time to final answer doesn&amp;rsquo;t change. Perceived speed does.&lt;/p&gt;
&lt;h2 id="phase-1--thinking-only"&gt;Phase 1 — Thinking only&lt;/h2&gt;
&lt;p&gt;Phase 1 is a single direct call to Ollama with thinking enabled. It gets exactly what Phase 2 will see — the same system prompt, the current question, and any attached images — so the reasoning reflects reality. Only the &lt;em&gt;reasoning&lt;/em&gt; tokens are consumed; whatever answer text Phase 1 starts to produce is discarded, because we don&amp;rsquo;t want a half-formed answer competing with the real one.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified from backend/services/rag_agent.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ollama_host&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;system&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;user&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;images&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Phase 1 is deliberately &lt;strong&gt;best-effort&lt;/strong&gt;: any failure here is swallowed and logged, and the stream moves straight on to Phase 2. A broken reasoning panel should never cost the user their answer.&lt;/p&gt;
&lt;h2 id="phase-2--agent-with-tools"&gt;Phase 2 — Agent with tools&lt;/h2&gt;
&lt;p&gt;Phase 2 builds a &lt;strong&gt;fresh Strands &lt;code&gt;Agent&lt;/code&gt; per request&lt;/strong&gt; — no shared mutable state between concurrent chats — restores the session&amp;rsquo;s conversation history into it, and runs the tool loop with six tools registered:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_knowledge_base(query)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hybrid FAISS + BM25 search, top-7, RRF fusion. Scope-filter-aware.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list_documents()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inventory of every indexed file with type and chunk count.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;analyze_document(filename)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inner Gemma call → structured summary (topics, entities, key facts).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compare_documents(doc_a, doc_b, question)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inner Gemma call answering across two documents.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;calculator(expression)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe AST evaluator — no &lt;code&gt;eval()&lt;/code&gt;, no arbitrary code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;current_time()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Timestamp for time-aware queries.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The agent decides which tools to call and in what order. There&amp;rsquo;s no hard-coded router; the system prompt explains what&amp;rsquo;s available and Strands handles the loop. For most document questions the path is: &lt;code&gt;search_knowledge_base&lt;/code&gt; → answer. For comparisons: &lt;code&gt;compare_documents&lt;/code&gt; → answer. For &amp;ldquo;what files do I have?&amp;rdquo;: &lt;code&gt;list_documents&lt;/code&gt; → answer. For greetings and arithmetic, the system prompt tells the agent it may skip search entirely. The model picks.&lt;/p&gt;
&lt;p&gt;Two details that took debugging to get right:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phase 2 runs with thinking explicitly disabled.&lt;/strong&gt; Without that flag, Gemma&amp;rsquo;s default behaviour can leak &lt;code&gt;&amp;lt;think&amp;gt;…&amp;lt;/think&amp;gt;&lt;/code&gt; tags into the visible answer, and everything before the closing tag gets swallowed by the Markdown renderer. One model option — &lt;code&gt;options={&amp;quot;thinking&amp;quot;: False}&lt;/code&gt; — fixed a &amp;ldquo;truncated responses&amp;rdquo; bug that looked much scarier than it was.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Citations are flushed before the first answer token.&lt;/strong&gt; Tools run before text deltas arrive, so by the time the first visible token streams, every source the search found is already in the sidebar. The accumulator is a request-local &lt;code&gt;ContextVar&lt;/code&gt; the search tool appends to.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified — the real loop reads Strands&amp;#39; raw event dicts&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stream_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;event&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;contentBlockDelta&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;delta&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new_citations&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="c1"&gt;# drain the ContextVar accumulator&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="why-this-matters-more-than-it-sounds"&gt;Why this matters more than it sounds&lt;/h2&gt;
&lt;p&gt;You could implement similar behaviour with one agent call that interleaves &lt;code&gt;thinking&lt;/code&gt; events with &lt;code&gt;text&lt;/code&gt; events. The reasons I split it anyway:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The thinking model and the tool model can be different.&lt;/strong&gt; Right now they&amp;rsquo;re both &lt;code&gt;gemma4:e4b&lt;/code&gt;, but the architecture lets me swap a smaller, faster model in for Phase 1 reasoning and keep the big one for Phase 2 tool use. I&amp;rsquo;m not doing that yet — but I want the option.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Phase 1 always streams immediately.&lt;/strong&gt; A pure agent loop only starts producing tokens after the model has decided what to say. Two-phase guarantees the user sees activity almost as soon as they press Enter, regardless of how complex the Phase 2 tool work gets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Failures isolate.&lt;/strong&gt; If Phase 2 falls over (Ollama timeout, tool error), Phase 1&amp;rsquo;s reasoning is still visible — the user can see &lt;em&gt;what the model was trying to do&lt;/em&gt;, which makes the error far less frustrating than a blank &amp;ldquo;something went wrong.&amp;rdquo;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="contextvar-isolation-again"&gt;ContextVar isolation, again&lt;/h2&gt;
&lt;p&gt;The same &lt;code&gt;ContextVar&lt;/code&gt; trick that scopes retrieval in
carries here. At the start of each &lt;code&gt;/rag&lt;/code&gt; stream, the handler sets two request-local variables: the &lt;strong&gt;document-scope filter&lt;/strong&gt; and the &lt;strong&gt;citation accumulator&lt;/strong&gt;. The agent&amp;rsquo;s tools read and write them implicitly. Conversation history itself lives in a per-session store guarded by per-session &lt;code&gt;asyncio&lt;/code&gt; locks, so two concurrent requests in the same chat can&amp;rsquo;t corrupt each other either.&lt;/p&gt;
&lt;p&gt;Tested with two browser tabs open on the same backend, scoped to different document categories, sending overlapping queries simultaneously. Zero cross-contamination. The test suite covers this explicitly in &lt;code&gt;test_thinking.py&lt;/code&gt; and &lt;code&gt;test_doc_scope_filter.py&lt;/code&gt; — see
for the broader story.&lt;/p&gt;
&lt;h2 id="the-frontend-side-of-the-contract"&gt;The frontend side of the contract&lt;/h2&gt;
&lt;p&gt;A detail that tripped me up: this is a &lt;code&gt;POST&lt;/code&gt; endpoint, so the browser&amp;rsquo;s &lt;code&gt;EventSource&lt;/code&gt; API (which only does GET) is out. The frontend uses &lt;code&gt;fetch&lt;/code&gt; and reads the response body incrementally, splitting on newlines and parsing each line as JSON:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-tsx" data-lang="tsx"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// Simplified from useRagStream.ts
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;/rag&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;POST&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;body&lt;/span&gt;: &lt;span class="kt"&gt;JSON.stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getReader&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;decoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;TextDecoder&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;read&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;decoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;: &lt;span class="kt"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;\n&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// keep the trailing partial line
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;line&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kr"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;line&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;appendThinking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;appendText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;addCitation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;memory&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;updateMemoryMeter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The reasoning panel starts &lt;strong&gt;collapsed&lt;/strong&gt;, with a small pulsing indicator while thinking tokens are still streaming — enough to signal &amp;ldquo;the model is working&amp;rdquo; without shoving a wall of chain-of-thought at the user. One click expands the full trace, during or after the stream.&lt;/p&gt;
&lt;h2 id="what-id-revisit"&gt;What I&amp;rsquo;d revisit&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 reasons toward a full answer, and we throw the answer part away.&lt;/strong&gt; A dedicated &amp;ldquo;plan your approach, don&amp;rsquo;t answer yet&amp;rdquo; prompt for Phase 1 would make the reasoning trace tighter and cheaper. Today it shares the main system prompt — simpler, but the trace can ramble.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No interrupt yet.&lt;/strong&gt; Once Phase 1 starts, it runs to completion. If the user types a follow-up mid-stream we let it finish. A real cancel button would mean wiring an abort signal through Ollama&amp;rsquo;s HTTP client — feasible, not yet done.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 occasionally over-thinks.&lt;/strong&gt; Greetings and trivial questions still produce a paragraph of reasoning. A &amp;ldquo;should I think?&amp;rdquo; gate (probably a tiny classifier or even a heuristic on query length) would skip Phase 1 entirely for those cases.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="takeaway"&gt;Takeaway&lt;/h2&gt;
&lt;p&gt;Streaming is &lt;em&gt;not&lt;/em&gt; just an optimisation. It&amp;rsquo;s a UX primitive. Two-phase streaming buys you a free property: the &lt;em&gt;visible&lt;/em&gt; part of the interaction starts before the &lt;em&gt;slow&lt;/em&gt; part does. The user gets to watch the model think, which is — genuinely — more interesting than watching a spinner.&lt;/p&gt;
&lt;p&gt;If your agent app feels slow even though the answers are fast, look at &lt;em&gt;when&lt;/em&gt; tokens start flowing. The fix often isn&amp;rsquo;t a faster model.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NDJSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Newline-Delimited JSON&lt;/td&gt;
&lt;td&gt;A stream where each line is its own complete JSON object — what &lt;code&gt;/rag&lt;/code&gt; emits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The universal text format for structured data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User Experience&lt;/td&gt;
&lt;td&gt;How the product feels to use — the real beneficiary of two-phase streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User Interface&lt;/td&gt;
&lt;td&gt;The visible surface the stream renders into&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;The dense half of hybrid retrieval (previous post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;The keyword half of hybrid retrieval (previous post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;The rank-only formula that merges the two result lists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstract Syntax Tree&lt;/td&gt;
&lt;td&gt;The parsed form of an expression — how the calculator evaluates maths without &lt;code&gt;eval()&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HyperText Transfer Protocol&lt;/td&gt;
&lt;td&gt;The protocol carrying the stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-Sent Events&lt;/td&gt;
&lt;td&gt;The browser&amp;rsquo;s built-in GET-only streaming format — notably &lt;em&gt;not&lt;/em&gt; usable here, because &lt;code&gt;/rag&lt;/code&gt; is a POST&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The boundary the frontend calls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— how CogniVault re-ingests edited PDFs without re-embedding everything, and survives a kill -9 mid-pipeline.&lt;/p&gt;</description></item></channel></rss>