Agents |

CogniVault Backend Explained, Part 3 · How a Question Becomes a Cited Answer

Fri, 12 Jun 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

You type a question. A few seconds later you get an answer with footnotes — the exact documents and pages it came from. This part walks through everything that happens in between.

In we built the knowledge base: every document chunked, embedded, and indexed. Now we get to use it — and this is where CogniVault stops being a pipeline and starts being interesting.

Two librarians, because one keeps failing you

Imagine a library with one librarian who organises everything by vibe. Ask her about “server downtime procedures” and she’s brilliant — she understands what you mean and finds documents that discuss the concept, whatever words they use. But ask her for “Error Code 404B” and she shrugs, handing you general networking guides. She doesn’t do exact strings.

Down the hall is a second librarian with a card catalogue. He finds the exact string “404B” instantly — but ask him a conceptual question phrased differently from the source text, and he finds nothing at all.

These are the two halves of search:

Semantic search (FAISS) — your question is embedded into a vector, and the index finds chunks whose vectors point the same way (technically: cosine similarity — how closely two arrows align). Great for meaning, blind to exact identifiers.
Keyword search (BM25) — a scoring formula that rewards chunks containing your exact words, weighted by how distinctive those words are. Great for identifiers, blind to synonyms.

CogniVault asks both librarians every time, then merges their answers with Reciprocal Rank Fusion (RRF) — a formula that combines ranked lists using only the positions:

score(chunk) = sum over both lists of 1 / (60 + rank)

A chunk ranked highly by either librarian scores well; a chunk both of them liked floats to the top. The elegance is what’s missing: you never have to reconcile FAISS’s similarity scores with BM25’s completely different scale, because ranks are the only input. The constant 60 comes straight from the original 2009 research paper, and yes, it’s cited in the code.

A few implementation details worth knowing: both searches deliberately over-fetch (at least 20 candidates each) so the fusion has material to work with; very weak semantic matches are dropped, but a keyword-perfect chunk can still be rescued through fusion; and the final answer uses the top 7 chunks. I benchmarked this whole setup against pure vector search in if you want the war stories.

The agent: a model that decides for itself

Here’s the second idea that trips up beginners: CogniVault’s chat is not “paste chunks into a prompt, get an answer.” It’s an agent — a model running in a loop where it can choose to call tools, read their results, and only then answer.

Built with the Strands Agents SDK, the agent gets six tools:

Tool	Job
`search_knowledge_base`	The core RAG tool — runs the hybrid search above, returns chunks with source and page
`list_documents`	See what’s in the vault
`analyze_document`	Structured analysis of one document: topics, entities, facts, summary
`compare_documents`	Answer a question by comparing two documents side by side
`calculator`	Safe maths — the expression is parsed into a syntax tree and only whitelisted operators run. No `eval()`, ever
`current_time`	The date and time

There is no hard-coded routing. The model reads your question and decides which tools to call, guided by its system prompt. Ask “compare the two contracts on termination clauses” and it reaches for compare_documents; ask “what’s 15% of 2,340” and it uses the calculator instead of hallucinating arithmetic.

Two safety details I want beginners to notice, because they’re the difference between a toy and a product: a fresh agent is constructed for every request (no shared state bleeding between concurrent chats), and the document-analysis tools call the model directly rather than through the agent — otherwise an agent calling a tool that calls the agent could recurse forever.

Watching the model think

When you send a message, the response streams back as NDJSON (Newline-Delimited JSON — each line of the stream is its own small JSON object). And it arrives in two phases:

Phase 1 — thinking. Gemma’s reasoning chain streams first, rendered in the collapsible panel above the answer. It’s deliberately best-effort: if it fails for any reason, the answer still comes.

Phase 2 — the agent answer. Tools run, citations appear in the Sources panel the moment the search completes — before the answer finishes writing — and the answer text streams in.

flowchart TB Q["Your question
(plus optional images, files, scope)"] --> P1 subgraph STREAM["POST /rag — one NDJSON stream"] P1["Phase 1: Thinking
reasoning chunks stream first"] P1 --> P2["Phase 2: Agent
fresh per request, history restored"] P2 -->|"decides to call"| T["search_knowledge_base"] T --> D["FAISS
semantic"] T --> S["BM25
keywords"] D --> RRF["RRF fusion — top 7 chunks"] S --> RRF RRF -->|"chunks + citations"| P2 P2 --> OUT["citations, then answer text,
then a memory-usage report"] end

Each line in the stream is typed: thinking, metadata (a citation), text (answer), memory (how full the conversation budget is), or error. The frontend just reads lines and routes them to the right panel. I dissected this design — and why thinking comes before the tool calls — in .

A memory budget, not a bottomless pit

Gemma’s context window (the amount of text the model can consider at once) is 128K tokens, but CogniVault doesn’t let conversation history sprawl across all of it. Each chat session gets a budget of 48,000 characters — roughly 12,000 tokens. Exceed it, and the oldest question-answer pair quietly drops out first, keeping the bulk of the window free for what matters: your current question and the retrieved chunks.

Two resilience touches worth stealing for your own projects:

Restart survival. In-memory history dies with the process. So the first message in a session after a backend restart rebuilds its history from the chat log the frontend persists. Multi-turn memory survives reboots.
Edit and regenerate. Editing an earlier message rewinds the stored history to that point before re-asking — the model genuinely forgets the timeline that no longer exists.

Scope: pinning the AI to specific documents

One last feature, and a lesson about small local models. You can pin a chat to specific files or a category. The filter travels with the request and a mandatory-search instruction is injected into both the system prompt and the user message itself.

Why both? Because small models sometimes skip instructions that live only in the system prompt — but they can’t ignore what’s inside the question. Belt and braces. When you work with 4-billion-parameter models instead of frontier ones, you learn to make instructions impossible to miss rather than hoping they’re followed.

The takeaway

A cited answer is four systems cooperating: two retrievers covering each other’s blind spots, a fusion formula that needs nothing but ranks, an agent that picks its own tools, and a stream that shows its work. None of the four is exotic on its own — the product is the cooperation.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them
FAISS	Facebook AI Similarity Search	The semantic (meaning-based) half of hybrid search
BM25	Best Match 25	The keyword half — a classic ranking formula from the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	Merges the two ranked lists using only each chunk’s rank: `score = Σ 1/(60 + rank)`
NDJSON	Newline-Delimited JSON	A stream where each line is its own complete JSON object — the chat response format
JSON	JavaScript Object Notation	The universal text format for structured data
AST	Abstract Syntax Tree	The parsed form of an expression — how the calculator does maths without `eval()`
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
SDK	Software Development Kit	A library of building blocks — here, Strands, which provides the agent loop
K (in 128K)	Kilo (thousand)	128K tokens ≈ 128,000 tokens — Gemma’s context window

Next up: — the same machinery pointed at generating quizzes, workshops, flashcards, and mindmaps, plus a table of every byte the app stores and exactly where it lives.

Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts

Thu, 30 Apr 2026 00:00:00 +0000

Part of a series on building . Previously: . All abbreviations are fully explained in the appendix at the bottom of the page.

When I first wired up Gemma 4 with inside CogniVault, the chat felt slow. Not laggy — slow in a way that’s worse than laggy. The user types a question. The cursor sits there. Then, eventually, an answer drops out of the void.

The model wasn’t idle. It was thinking. Gemma 4 has a chain-of-thought mode that produces a (sometimes long) reasoning trace before its final reply. With a single-phase agent stream, all of that thinking is happening inside the agent loop — silently — before any tool calls run, before any tokens get emitted to the UI.

So I split the call into two phases.

The shape

POST /rag
 │
 ├── Phase 1 — Direct Ollama call, thinking enabled
 │ stream: {"type":"thinking","data":"..."} (reasoning tokens)
 │
 └── Phase 2 — Strands Agent (thinking disabled)
 stream: {"type":"metadata","data":{...}} (citations, as soon as search runs)
 stream: {"type":"text","data":"..."} (answer tokens)
 stream: {"type":"memory","data":{...}} (end-of-stream: session memory usage)

The endpoint streams newline-delimited JSON (NDJSON): each line of the response body is one self-contained JSON envelope with a type and a data. The frontend dispatches on type and renders accordingly: a collapsible reasoning panel for the thinking tokens, the main message bubble for the text tokens, a sidebar card per citation.

The user sees the model start thinking immediately. Latency to first byte drops from “long enough to wonder if it crashed” to “instant.” Total time to final answer doesn’t change. Perceived speed does.

Phase 1 — Thinking only

Phase 1 is a single direct call to Ollama with thinking enabled. It gets exactly what Phase 2 will see — the same system prompt, the current question, and any attached images — so the reasoning reflects reality. Only the reasoning tokens are consumed; whatever answer text Phase 1 starts to produce is discarded, because we don’t want a half-formed answer competing with the real one.

# Simplified from backend/services/rag_agent.py
client = ollama.AsyncClient(host=settings.ollama_host)
stream = await client.chat(
 model=settings.llm_model,
 messages=[
 {"role": "system", "content": system_prompt},
 {"role": "user", "content": query, "images": images},
 ],
 options={"thinking": True},
 stream=True,
)
async for chunk in stream:
 if chunk.message.thinking:
 yield envelope("thinking", chunk.message.thinking)

Phase 1 is deliberately best-effort: any failure here is swallowed and logged, and the stream moves straight on to Phase 2. A broken reasoning panel should never cost the user their answer.

Phase 2 — Agent with tools

Phase 2 builds a fresh Strands Agent per request — no shared mutable state between concurrent chats — restores the session’s conversation history into it, and runs the tool loop with six tools registered:

Tool	Purpose
`search_knowledge_base(query)`	Hybrid FAISS + BM25 search, top-7, RRF fusion. Scope-filter-aware.
`list_documents()`	Inventory of every indexed file with type and chunk count.
`analyze_document(filename)`	Inner Gemma call → structured summary (topics, entities, key facts).
`compare_documents(doc_a, doc_b, question)`	Inner Gemma call answering across two documents.
`calculator(expression)`	Safe AST evaluator — no `eval()`, no arbitrary code.
`current_time()`	Timestamp for time-aware queries.

The agent decides which tools to call and in what order. There’s no hard-coded router; the system prompt explains what’s available and Strands handles the loop. For most document questions the path is: search_knowledge_base → answer. For comparisons: compare_documents → answer. For “what files do I have?”: list_documents → answer. For greetings and arithmetic, the system prompt tells the agent it may skip search entirely. The model picks.

Two details that took debugging to get right:

Phase 2 runs with thinking explicitly disabled. Without that flag, Gemma’s default behaviour can leak <think>…</think> tags into the visible answer, and everything before the closing tag gets swallowed by the Markdown renderer. One model option — options={"thinking": False} — fixed a “truncated responses” bug that looked much scarier than it was.
Citations are flushed before the first answer token. Tools run before text deltas arrive, so by the time the first visible token streams, every source the search found is already in the sidebar. The accumulator is a request-local ContextVar the search tool appends to.

# Simplified — the real loop reads Strands' raw event dicts
async for event in agent.stream_async(user_input):
 delta = event["event"].get("contentBlockDelta", {}).get("delta", {}).get("text")
 if delta:
 for doc in new_citations(): # drain the ContextVar accumulator
 yield envelope("metadata", doc)
 yield envelope("text", delta)

Why this matters more than it sounds

You could implement similar behaviour with one agent call that interleaves thinking events with text events. The reasons I split it anyway:

The thinking model and the tool model can be different. Right now they’re both gemma4:e4b, but the architecture lets me swap a smaller, faster model in for Phase 1 reasoning and keep the big one for Phase 2 tool use. I’m not doing that yet — but I want the option.
Phase 1 always streams immediately. A pure agent loop only starts producing tokens after the model has decided what to say. Two-phase guarantees the user sees activity almost as soon as they press Enter, regardless of how complex the Phase 2 tool work gets.
Failures isolate. If Phase 2 falls over (Ollama timeout, tool error), Phase 1’s reasoning is still visible — the user can see what the model was trying to do, which makes the error far less frustrating than a blank “something went wrong.”

ContextVar isolation, again

The same ContextVar trick that scopes retrieval in carries here. At the start of each /rag stream, the handler sets two request-local variables: the document-scope filter and the citation accumulator. The agent’s tools read and write them implicitly. Conversation history itself lives in a per-session store guarded by per-session asyncio locks, so two concurrent requests in the same chat can’t corrupt each other either.

Tested with two browser tabs open on the same backend, scoped to different document categories, sending overlapping queries simultaneously. Zero cross-contamination. The test suite covers this explicitly in test_thinking.py and test_doc_scope_filter.py — see for the broader story.

The frontend side of the contract

A detail that tripped me up: this is a POST endpoint, so the browser’s EventSource API (which only does GET) is out. The frontend uses fetch and reads the response body incrementally, splitting on newlines and parsing each line as JSON:

// Simplified from useRagStream.ts
const res = await fetch("/rag", {
 method: "POST",
 body: JSON.stringify(payload),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
 const { done, value } = await reader.read();
 if (done) break;
 buffer += decoder.decode(value, { stream: true });
 const lines = buffer.split("\n");
 buffer = lines.pop()!; // keep the trailing partial line
 for (const line of lines) {
 if (!line.trim()) continue;
 const { type, data } = JSON.parse(line);
 switch (type) {
 case "thinking":
 appendThinking(data);
 break;
 case "text":
 appendText(data);
 break;
 case "metadata":
 addCitation(data);
 break;
 case "memory":
 updateMemoryMeter(data);
 break;
 }
 }
}

The reasoning panel starts collapsed, with a small pulsing indicator while thinking tokens are still streaming — enough to signal “the model is working” without shoving a wall of chain-of-thought at the user. One click expands the full trace, during or after the stream.

What I’d revisit

Phase 1 reasons toward a full answer, and we throw the answer part away. A dedicated “plan your approach, don’t answer yet” prompt for Phase 1 would make the reasoning trace tighter and cheaper. Today it shares the main system prompt — simpler, but the trace can ramble.
No interrupt yet. Once Phase 1 starts, it runs to completion. If the user types a follow-up mid-stream we let it finish. A real cancel button would mean wiring an abort signal through Ollama’s HTTP client — feasible, not yet done.
Phase 1 occasionally over-thinks. Greetings and trivial questions still produce a paragraph of reasoning. A “should I think?” gate (probably a tiny classifier or even a heuristic on query length) would skip Phase 1 entirely for those cases.

Takeaway

Streaming is not just an optimisation. It’s a UX primitive. Two-phase streaming buys you a free property: the visible part of the interaction starts before the slow part does. The user gets to watch the model think, which is — genuinely — more interesting than watching a spinner.

If your agent app feels slow even though the answers are fast, look at when tokens start flowing. The fix often isn’t a faster model.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
NDJSON	Newline-Delimited JSON	A stream where each line is its own complete JSON object — what `/rag` emits
JSON	JavaScript Object Notation	The universal text format for structured data
UX	User Experience	How the product feels to use — the real beneficiary of two-phase streaming
UI	User Interface	The visible surface the stream renders into
FAISS	Facebook AI Similarity Search	The dense half of hybrid retrieval (previous post)
BM25	Best Match 25	The keyword half of hybrid retrieval (previous post)
RRF	Reciprocal Rank Fusion	The rank-only formula that merges the two result lists
AST	Abstract Syntax Tree	The parsed form of an expression — how the calculator evaluates maths without `eval()`
HTTP	HyperText Transfer Protocol	The protocol carrying the stream
SSE	Server-Sent Events	The browser’s built-in GET-only streaming format — notably not usable here, because `/rag` is a POST
API	Application Programming Interface	The boundary the frontend calls

Next up: — how CogniVault re-ingests edited PDFs without re-embedding everything, and survives a kill -9 mid-pipeline.