Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts
Part of a series on building Gemma CogniVault. Previously: Hybrid retrieval with FAISS + BM25 + RRF. All abbreviations are fully explained in the appendix at the bottom of the page.
When I first wired up Gemma 4 with Strands Agents inside CogniVault, the chat felt slow. Not laggy — slow in a way that’s worse than laggy. The user types a question. The cursor sits there. Then, eventually, an answer drops out of the void.
The model wasn’t idle. It was thinking. Gemma 4 has a chain-of-thought mode that produces a (sometimes long) reasoning trace before its final reply. With a single-phase agent stream, all of that thinking is happening inside the agent loop — silently — before any tool calls run, before any tokens get emitted to the UI.
So I split the call into two phases.
The shape
POST /rag
│
├── Phase 1 — Direct Ollama call, thinking enabled
│ stream: {"type":"thinking","data":"..."} (reasoning tokens)
│
└── Phase 2 — Strands Agent (thinking disabled)
stream: {"type":"metadata","data":{...}} (citations, as soon as search runs)
stream: {"type":"text","data":"..."} (answer tokens)
stream: {"type":"memory","data":{...}} (end-of-stream: session memory usage)
The endpoint streams newline-delimited JSON (NDJSON): each line of the response body is one self-contained JSON envelope with a type and a data. The frontend dispatches on type and renders accordingly: a collapsible reasoning panel for the thinking tokens, the main message bubble for the text tokens, a sidebar card per citation.
The user sees the model start thinking immediately. Latency to first byte drops from “long enough to wonder if it crashed” to “instant.” Total time to final answer doesn’t change. Perceived speed does.
Phase 1 — Thinking only
Phase 1 is a single direct call to Ollama with thinking enabled. It gets exactly what Phase 2 will see — the same system prompt, the current question, and any attached images — so the reasoning reflects reality. Only the reasoning tokens are consumed; whatever answer text Phase 1 starts to produce is discarded, because we don’t want a half-formed answer competing with the real one.
# Simplified from backend/services/rag_agent.py
client = ollama.AsyncClient(host=settings.ollama_host)
stream = await client.chat(
model=settings.llm_model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": query, "images": images},
],
options={"thinking": True},
stream=True,
)
async for chunk in stream:
if chunk.message.thinking:
yield envelope("thinking", chunk.message.thinking)
Phase 1 is deliberately best-effort: any failure here is swallowed and logged, and the stream moves straight on to Phase 2. A broken reasoning panel should never cost the user their answer.
Phase 2 — Agent with tools
Phase 2 builds a fresh Strands Agent per request — no shared mutable state between concurrent chats — restores the session’s conversation history into it, and runs the tool loop with six tools registered:
| Tool | Purpose |
|---|---|
search_knowledge_base(query) | Hybrid FAISS + BM25 search, top-7, RRF fusion. Scope-filter-aware. |
list_documents() | Inventory of every indexed file with type and chunk count. |
analyze_document(filename) | Inner Gemma call → structured summary (topics, entities, key facts). |
compare_documents(doc_a, doc_b, question) | Inner Gemma call answering across two documents. |
calculator(expression) | Safe AST evaluator — no eval(), no arbitrary code. |
current_time() | Timestamp for time-aware queries. |
The agent decides which tools to call and in what order. There’s no hard-coded router; the system prompt explains what’s available and Strands handles the loop. For most document questions the path is: search_knowledge_base → answer. For comparisons: compare_documents → answer. For “what files do I have?”: list_documents → answer. For greetings and arithmetic, the system prompt tells the agent it may skip search entirely. The model picks.
Two details that took debugging to get right:
- Phase 2 runs with thinking explicitly disabled. Without that flag, Gemma’s default behaviour can leak
<think>…</think>tags into the visible answer, and everything before the closing tag gets swallowed by the Markdown renderer. One model option —options={"thinking": False}— fixed a “truncated responses” bug that looked much scarier than it was. - Citations are flushed before the first answer token. Tools run before text deltas arrive, so by the time the first visible token streams, every source the search found is already in the sidebar. The accumulator is a request-local
ContextVarthe search tool appends to.
# Simplified — the real loop reads Strands' raw event dicts
async for event in agent.stream_async(user_input):
delta = event["event"].get("contentBlockDelta", {}).get("delta", {}).get("text")
if delta:
for doc in new_citations(): # drain the ContextVar accumulator
yield envelope("metadata", doc)
yield envelope("text", delta)
Why this matters more than it sounds
You could implement similar behaviour with one agent call that interleaves thinking events with text events. The reasons I split it anyway:
The thinking model and the tool model can be different. Right now they’re both
gemma4:e4b, but the architecture lets me swap a smaller, faster model in for Phase 1 reasoning and keep the big one for Phase 2 tool use. I’m not doing that yet — but I want the option.Phase 1 always streams immediately. A pure agent loop only starts producing tokens after the model has decided what to say. Two-phase guarantees the user sees activity almost as soon as they press Enter, regardless of how complex the Phase 2 tool work gets.
Failures isolate. If Phase 2 falls over (Ollama timeout, tool error), Phase 1’s reasoning is still visible — the user can see what the model was trying to do, which makes the error far less frustrating than a blank “something went wrong.”
ContextVar isolation, again
The same ContextVar trick that scopes retrieval in the last post carries here. At the start of each /rag stream, the handler sets two request-local variables: the document-scope filter and the citation accumulator. The agent’s tools read and write them implicitly. Conversation history itself lives in a per-session store guarded by per-session asyncio locks, so two concurrent requests in the same chat can’t corrupt each other either.
Tested with two browser tabs open on the same backend, scoped to different document categories, sending overlapping queries simultaneously. Zero cross-contamination. The test suite covers this explicitly in test_thinking.py and test_doc_scope_filter.py — see the testing post for the broader story.
The frontend side of the contract
A detail that tripped me up: this is a POST endpoint, so the browser’s EventSource API (which only does GET) is out. The frontend uses fetch and reads the response body incrementally, splitting on newlines and parsing each line as JSON:
// Simplified from useRagStream.ts
const res = await fetch("/rag", {
method: "POST",
body: JSON.stringify(payload),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop()!; // keep the trailing partial line
for (const line of lines) {
if (!line.trim()) continue;
const { type, data } = JSON.parse(line);
switch (type) {
case "thinking":
appendThinking(data);
break;
case "text":
appendText(data);
break;
case "metadata":
addCitation(data);
break;
case "memory":
updateMemoryMeter(data);
break;
}
}
}
The reasoning panel starts collapsed, with a small pulsing indicator while thinking tokens are still streaming — enough to signal “the model is working” without shoving a wall of chain-of-thought at the user. One click expands the full trace, during or after the stream.
What I’d revisit
- Phase 1 reasons toward a full answer, and we throw the answer part away. A dedicated “plan your approach, don’t answer yet” prompt for Phase 1 would make the reasoning trace tighter and cheaper. Today it shares the main system prompt — simpler, but the trace can ramble.
- No interrupt yet. Once Phase 1 starts, it runs to completion. If the user types a follow-up mid-stream we let it finish. A real cancel button would mean wiring an abort signal through Ollama’s HTTP client — feasible, not yet done.
- Phase 1 occasionally over-thinks. Greetings and trivial questions still produce a paragraph of reasoning. A “should I think?” gate (probably a tiny classifier or even a heuristic on query length) would skip Phase 1 entirely for those cases.
Takeaway
Streaming is not just an optimisation. It’s a UX primitive. Two-phase streaming buys you a free property: the visible part of the interaction starts before the slow part does. The user gets to watch the model think, which is — genuinely — more interesting than watching a spinner.
If your agent app feels slow even though the answers are fast, look at when tokens start flowing. The fix often isn’t a faster model.
Appendix: Abbreviations in this post
| Abbreviation | Full form | Meaning |
|---|---|---|
| NDJSON | Newline-Delimited JSON | A stream where each line is its own complete JSON object — what /rag emits |
| JSON | JavaScript Object Notation | The universal text format for structured data |
| UX | User Experience | How the product feels to use — the real beneficiary of two-phase streaming |
| UI | User Interface | The visible surface the stream renders into |
| FAISS | Facebook AI Similarity Search | The dense half of hybrid retrieval (previous post) |
| BM25 | Best Match 25 | The keyword half of hybrid retrieval (previous post) |
| RRF | Reciprocal Rank Fusion | The rank-only formula that merges the two result lists |
| AST | Abstract Syntax Tree | The parsed form of an expression — how the calculator evaluates maths without eval() |
| HTTP | HyperText Transfer Protocol | The protocol carrying the stream |
| SSE | Server-Sent Events | The browser’s built-in GET-only streaming format — notably not usable here, because /rag is a POST |
| API | Application Programming Interface | The boundary the frontend calls |
Next up: Crash-resumable ingestion with DBOS — how CogniVault re-ingests edited PDFs without re-embedding everything, and survives a kill -9 mid-pipeline.

Related
- Part 5 · Getting Reliable JSON Out of a Local LLM
- Part 1 · Why I Built a Local-First RAG
- Part 8 · Testing a Local-AI App: 351 Tests, Zero Infrastructure
- Part 2 · Hybrid Retrieval in Practice: FAISS + BM25, Fused with RRF
- CogniVault Backend Explained, Part 1 · Meet the Backend: Three Processes, Four Layers