<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Streaming |</title><link>https://aretascodes.dev/tags/streaming/</link><atom:link href="https://aretascodes.dev/tags/streaming/index.xml" rel="self" type="application/rss+xml"/><description>Streaming</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Thu, 30 Apr 2026 00:00:00 +0000</lastBuildDate><image><url>https://aretascodes.dev/media/icon_hu_2ab4f4763b27c75b.png</url><title>Streaming</title><link>https://aretascodes.dev/tags/streaming/</link></image><item><title>Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts</title><link>https://aretascodes.dev/blog/two-phase-streaming-strands-agents/</link><pubDate>Thu, 30 Apr 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/two-phase-streaming-strands-agents/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
. Previously:
.
All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When I first wired up Gemma 4 with
inside CogniVault, the chat felt slow. Not laggy — slow in a way that&amp;rsquo;s worse than laggy. The user types a question. The cursor sits there. Then, eventually, an answer drops out of the void.&lt;/p&gt;
&lt;p&gt;The model wasn&amp;rsquo;t idle. It was &lt;em&gt;thinking&lt;/em&gt;. Gemma 4 has a chain-of-thought mode that produces a (sometimes long) reasoning trace before its final reply. With a single-phase agent stream, all of that thinking is happening &lt;em&gt;inside the agent loop&lt;/em&gt; — silently — before any tool calls run, before any tokens get emitted to the UI.&lt;/p&gt;
&lt;p&gt;So I split the call into two phases.&lt;/p&gt;
&lt;h2 id="the-shape"&gt;The shape&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;POST /rag
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ├── Phase 1 — Direct Ollama call, thinking enabled
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │ stream: {&amp;#34;type&amp;#34;:&amp;#34;thinking&amp;#34;,&amp;#34;data&amp;#34;:&amp;#34;...&amp;#34;} (reasoning tokens)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; │
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── Phase 2 — Strands Agent (thinking disabled)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;metadata&amp;#34;,&amp;#34;data&amp;#34;:{...}} (citations, as soon as search runs)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;text&amp;#34;,&amp;#34;data&amp;#34;:&amp;#34;...&amp;#34;} (answer tokens)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; stream: {&amp;#34;type&amp;#34;:&amp;#34;memory&amp;#34;,&amp;#34;data&amp;#34;:{...}} (end-of-stream: session memory usage)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The endpoint streams &lt;strong&gt;newline-delimited JSON&lt;/strong&gt; (NDJSON): each line of the response body is one self-contained JSON envelope with a &lt;code&gt;type&lt;/code&gt; and a &lt;code&gt;data&lt;/code&gt;. The frontend dispatches on &lt;code&gt;type&lt;/code&gt; and renders accordingly: a &lt;strong&gt;collapsible reasoning panel&lt;/strong&gt; for the thinking tokens, the main message bubble for the text tokens, a sidebar card per citation.&lt;/p&gt;
&lt;p&gt;The user sees the model start thinking &lt;em&gt;immediately&lt;/em&gt;. Latency to first byte drops from &amp;ldquo;long enough to wonder if it crashed&amp;rdquo; to &amp;ldquo;instant.&amp;rdquo; Total time to final answer doesn&amp;rsquo;t change. Perceived speed does.&lt;/p&gt;
&lt;h2 id="phase-1--thinking-only"&gt;Phase 1 — Thinking only&lt;/h2&gt;
&lt;p&gt;Phase 1 is a single direct call to Ollama with thinking enabled. It gets exactly what Phase 2 will see — the same system prompt, the current question, and any attached images — so the reasoning reflects reality. Only the &lt;em&gt;reasoning&lt;/em&gt; tokens are consumed; whatever answer text Phase 1 starts to produce is discarded, because we don&amp;rsquo;t want a half-formed answer competing with the real one.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified from backend/services/rag_agent.py&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ollama_host&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;system&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;role&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;user&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;content&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;images&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Phase 1 is deliberately &lt;strong&gt;best-effort&lt;/strong&gt;: any failure here is swallowed and logged, and the stream moves straight on to Phase 2. A broken reasoning panel should never cost the user their answer.&lt;/p&gt;
&lt;h2 id="phase-2--agent-with-tools"&gt;Phase 2 — Agent with tools&lt;/h2&gt;
&lt;p&gt;Phase 2 builds a &lt;strong&gt;fresh Strands &lt;code&gt;Agent&lt;/code&gt; per request&lt;/strong&gt; — no shared mutable state between concurrent chats — restores the session&amp;rsquo;s conversation history into it, and runs the tool loop with six tools registered:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;search_knowledge_base(query)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hybrid FAISS + BM25 search, top-7, RRF fusion. Scope-filter-aware.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list_documents()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inventory of every indexed file with type and chunk count.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;analyze_document(filename)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inner Gemma call → structured summary (topics, entities, key facts).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;compare_documents(doc_a, doc_b, question)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inner Gemma call answering across two documents.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;calculator(expression)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe AST evaluator — no &lt;code&gt;eval()&lt;/code&gt;, no arbitrary code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;current_time()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Timestamp for time-aware queries.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The agent decides which tools to call and in what order. There&amp;rsquo;s no hard-coded router; the system prompt explains what&amp;rsquo;s available and Strands handles the loop. For most document questions the path is: &lt;code&gt;search_knowledge_base&lt;/code&gt; → answer. For comparisons: &lt;code&gt;compare_documents&lt;/code&gt; → answer. For &amp;ldquo;what files do I have?&amp;rdquo;: &lt;code&gt;list_documents&lt;/code&gt; → answer. For greetings and arithmetic, the system prompt tells the agent it may skip search entirely. The model picks.&lt;/p&gt;
&lt;p&gt;Two details that took debugging to get right:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phase 2 runs with thinking explicitly disabled.&lt;/strong&gt; Without that flag, Gemma&amp;rsquo;s default behaviour can leak &lt;code&gt;&amp;lt;think&amp;gt;…&amp;lt;/think&amp;gt;&lt;/code&gt; tags into the visible answer, and everything before the closing tag gets swallowed by the Markdown renderer. One model option — &lt;code&gt;options={&amp;quot;thinking&amp;quot;: False}&lt;/code&gt; — fixed a &amp;ldquo;truncated responses&amp;rdquo; bug that looked much scarier than it was.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Citations are flushed before the first answer token.&lt;/strong&gt; Tools run before text deltas arrive, so by the time the first visible token streams, every source the search found is already in the sidebar. The accumulator is a request-local &lt;code&gt;ContextVar&lt;/code&gt; the search tool appends to.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Simplified — the real loop reads Strands&amp;#39; raw event dicts&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stream_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;event&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;contentBlockDelta&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;delta&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new_citations&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="c1"&gt;# drain the ContextVar accumulator&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="why-this-matters-more-than-it-sounds"&gt;Why this matters more than it sounds&lt;/h2&gt;
&lt;p&gt;You could implement similar behaviour with one agent call that interleaves &lt;code&gt;thinking&lt;/code&gt; events with &lt;code&gt;text&lt;/code&gt; events. The reasons I split it anyway:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The thinking model and the tool model can be different.&lt;/strong&gt; Right now they&amp;rsquo;re both &lt;code&gt;gemma4:e4b&lt;/code&gt;, but the architecture lets me swap a smaller, faster model in for Phase 1 reasoning and keep the big one for Phase 2 tool use. I&amp;rsquo;m not doing that yet — but I want the option.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Phase 1 always streams immediately.&lt;/strong&gt; A pure agent loop only starts producing tokens after the model has decided what to say. Two-phase guarantees the user sees activity almost as soon as they press Enter, regardless of how complex the Phase 2 tool work gets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Failures isolate.&lt;/strong&gt; If Phase 2 falls over (Ollama timeout, tool error), Phase 1&amp;rsquo;s reasoning is still visible — the user can see &lt;em&gt;what the model was trying to do&lt;/em&gt;, which makes the error far less frustrating than a blank &amp;ldquo;something went wrong.&amp;rdquo;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="contextvar-isolation-again"&gt;ContextVar isolation, again&lt;/h2&gt;
&lt;p&gt;The same &lt;code&gt;ContextVar&lt;/code&gt; trick that scopes retrieval in
carries here. At the start of each &lt;code&gt;/rag&lt;/code&gt; stream, the handler sets two request-local variables: the &lt;strong&gt;document-scope filter&lt;/strong&gt; and the &lt;strong&gt;citation accumulator&lt;/strong&gt;. The agent&amp;rsquo;s tools read and write them implicitly. Conversation history itself lives in a per-session store guarded by per-session &lt;code&gt;asyncio&lt;/code&gt; locks, so two concurrent requests in the same chat can&amp;rsquo;t corrupt each other either.&lt;/p&gt;
&lt;p&gt;Tested with two browser tabs open on the same backend, scoped to different document categories, sending overlapping queries simultaneously. Zero cross-contamination. The test suite covers this explicitly in &lt;code&gt;test_thinking.py&lt;/code&gt; and &lt;code&gt;test_doc_scope_filter.py&lt;/code&gt; — see
for the broader story.&lt;/p&gt;
&lt;h2 id="the-frontend-side-of-the-contract"&gt;The frontend side of the contract&lt;/h2&gt;
&lt;p&gt;A detail that tripped me up: this is a &lt;code&gt;POST&lt;/code&gt; endpoint, so the browser&amp;rsquo;s &lt;code&gt;EventSource&lt;/code&gt; API (which only does GET) is out. The frontend uses &lt;code&gt;fetch&lt;/code&gt; and reads the response body incrementally, splitting on newlines and parsing each line as JSON:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-tsx" data-lang="tsx"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;// Simplified from useRagStream.ts
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;/rag&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;POST&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;body&lt;/span&gt;: &lt;span class="kt"&gt;JSON.stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getReader&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;decoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;TextDecoder&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;read&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;decoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;: &lt;span class="kt"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;\n&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// keep the trailing partial line
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;line&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kr"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;line&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;thinking&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;appendThinking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;appendText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;metadata&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;addCitation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;memory&amp;#34;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nx"&gt;updateMemoryMeter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The reasoning panel starts &lt;strong&gt;collapsed&lt;/strong&gt;, with a small pulsing indicator while thinking tokens are still streaming — enough to signal &amp;ldquo;the model is working&amp;rdquo; without shoving a wall of chain-of-thought at the user. One click expands the full trace, during or after the stream.&lt;/p&gt;
&lt;h2 id="what-id-revisit"&gt;What I&amp;rsquo;d revisit&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 reasons toward a full answer, and we throw the answer part away.&lt;/strong&gt; A dedicated &amp;ldquo;plan your approach, don&amp;rsquo;t answer yet&amp;rdquo; prompt for Phase 1 would make the reasoning trace tighter and cheaper. Today it shares the main system prompt — simpler, but the trace can ramble.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No interrupt yet.&lt;/strong&gt; Once Phase 1 starts, it runs to completion. If the user types a follow-up mid-stream we let it finish. A real cancel button would mean wiring an abort signal through Ollama&amp;rsquo;s HTTP client — feasible, not yet done.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 occasionally over-thinks.&lt;/strong&gt; Greetings and trivial questions still produce a paragraph of reasoning. A &amp;ldquo;should I think?&amp;rdquo; gate (probably a tiny classifier or even a heuristic on query length) would skip Phase 1 entirely for those cases.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="takeaway"&gt;Takeaway&lt;/h2&gt;
&lt;p&gt;Streaming is &lt;em&gt;not&lt;/em&gt; just an optimisation. It&amp;rsquo;s a UX primitive. Two-phase streaming buys you a free property: the &lt;em&gt;visible&lt;/em&gt; part of the interaction starts before the &lt;em&gt;slow&lt;/em&gt; part does. The user gets to watch the model think, which is — genuinely — more interesting than watching a spinner.&lt;/p&gt;
&lt;p&gt;If your agent app feels slow even though the answers are fast, look at &lt;em&gt;when&lt;/em&gt; tokens start flowing. The fix often isn&amp;rsquo;t a faster model.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NDJSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Newline-Delimited JSON&lt;/td&gt;
&lt;td&gt;A stream where each line is its own complete JSON object — what &lt;code&gt;/rag&lt;/code&gt; emits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The universal text format for structured data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User Experience&lt;/td&gt;
&lt;td&gt;How the product feels to use — the real beneficiary of two-phase streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User Interface&lt;/td&gt;
&lt;td&gt;The visible surface the stream renders into&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;The dense half of hybrid retrieval (previous post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;The keyword half of hybrid retrieval (previous post)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;The rank-only formula that merges the two result lists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstract Syntax Tree&lt;/td&gt;
&lt;td&gt;The parsed form of an expression — how the calculator evaluates maths without &lt;code&gt;eval()&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HyperText Transfer Protocol&lt;/td&gt;
&lt;td&gt;The protocol carrying the stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-Sent Events&lt;/td&gt;
&lt;td&gt;The browser&amp;rsquo;s built-in GET-only streaming format — notably &lt;em&gt;not&lt;/em&gt; usable here, because &lt;code&gt;/rag&lt;/code&gt; is a POST&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The boundary the frontend calls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— how CogniVault re-ingests edited PDFs without re-embedding everything, and survives a kill -9 mid-pipeline.&lt;/p&gt;</description></item></channel></rss>