FastAPI |

CogniVault Backend Explained, Part 1 · Meet the Backend: Three Processes, Four Layers

Fri, 12 Jun 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

When people first open the CogniVault repository, the question I hear most is some version of: “Where do I even start?” There’s a RAG agent, a FAISS index, a DBOS workflow, an Ollama host — and if you’re transitioning into tech, every one of those words is a closed door.

This series opens the doors one at a time. No prior RAG knowledge assumed, every abbreviation spelled out, and every claim checkable against the . If you’ve already read my , think of this series as the guided tour that should have come first.

Let’s map this out.

The whole app is three processes

CogniVault lets you chat with your own documents and turn them into quizzes, workshops, flashcards, and mindmaps — and nothing ever leaves your machine. (The why behind that constraint is its own story: .)

You might expect an app like that to be a sprawl of microservices. It’s three processes:

Process	What it does
The Python backend	One FastAPI app on port 8000 — it also serves the compiled React frontend as static files
Ollama	The local model server on port 11434, running the AI models
PostgreSQL	One Docker container, used only for workflow checkpoints — never for your documents

Everything else — your files, the search index, your chat history, your quiz scores — is a plain file on disk. That’s not laziness; it’s the privacy argument made physical. You can open every byte the app stores with a text editor and a SQLite browser.

The four layers

Before we name technologies, here’s the mental model I want you to keep for the whole series. The backend is four layers, top to bottom:

Layer 1 — the web layer. A FastAPI application receives every HTTP request and routes it to one of six routers: chat (/rag), knowledge management (/upload, /ingest), study tools (/api/study/*), progress (/api/progress/*), voice (/api/transcribe), and chat history (/api/history). FastAPI (a modern Python web framework) also auto-generates interactive API documentation at /api/docs, which is the best way to explore the backend without reading a line of code.

Layer 2 — the intelligence layer. Two AI models with two different jobs. gemma4:e4b generates: chat answers, reasoning, image analysis, and tool calls. embeddinggemma embeds: it turns text into vectors (lists of numbers that capture meaning) so similar ideas can be found mathematically. Both run inside Ollama — think of Ollama as Docker, but for AI models.

Layer 3 — the retrieval layer. A search engine over your documents that combines semantic search (find things that mean the same) with keyword search (find the exact string). Part 3 of this series is entirely about this layer.

Layer 4 — the persistence layer. Four storage systems, each picked for one job: a FAISS index plus a JSON file for searchable knowledge, SQLite for study data, PostgreSQL for workflow checkpoints, and plain JSON files for chat history.

One diagram, every major piece

flowchart TB subgraph CLIENT["Browser"] UI["React Frontend
(compiled, served by FastAPI)"] end subgraph SERVER["FastAPI Backend — port 8000"] ROUTERS["6 Routers
rag · knowledge · study ·
progress · audio · history"] AGENT["RAG Agent
(Strands SDK, 6 tools)"] VDB["VectorDB
FAISS + BM25 + RRF"] INGEST["Ingestion
(DBOS durable workflow)"] GEN["Study generators
quiz · workshop · cards · mindmap"] PROG["Progress tracker
+ 25 achievements"] end subgraph OLLAMA["Ollama — port 11434"] GEMMA["gemma4:e4b
chat · thinking · vision · tools"] EMBED["embeddinggemma
text to vectors"] end subgraph STORAGE["Local storage"] FAISSF["vector_store.faiss + .json"] SQLITE["progress.db (SQLite)"] PG["PostgreSQL
workflow state only"] DOCS["docs/ folder + chat_history.json"] end UI --> ROUTERS ROUTERS --> AGENT --> VDB AGENT --> GEMMA VDB --> EMBED ROUTERS --> INGEST --> EMBED INGEST --> PG INGEST --> FAISSF VDB --- FAISSF ROUTERS --> GEN --> GEMMA GEN --> SQLITE ROUTERS --> PROG --> SQLITE ROUTERS --> DOCS

Keep this picture handy — Parts 2, 3, and 4 each zoom into one region of it.

The tech stack, and why each piece earned its place

The full dependency list lives in requirements.txt. Here’s what matters, grouped by job:

Serving requests. FastAPI defines the endpoints and validates every request and response with Pydantic (a data-validation library — think of it as a strict customs officer for JSON). Uvicorn is the ASGI server (Asynchronous Server Gateway Interface — the Python standard that lets one process juggle many simultaneous requests) that actually runs it.

Thinking. Ollama serves gemma4:e4b — the e4b tag is the roughly four-billion effective-parameter variant, about a 9.6 GB download — and embeddinggemma (about 622 MB). The agent behaviour is built with the Strands Agents SDK, which wraps the model in a loop where it can call tools, read the results, and only then answer. (Where I run Ollama relative to Docker is a deliberate choice with a story behind it: .)

Finding things. FAISS (Facebook AI Similarity Search — Meta’s vector search library) handles semantic lookups; rank-bm25 handles keyword lookups; a formula called Reciprocal Rank Fusion merges the two. Part 3 unpacks all of this.

Reading documents. pypdf for PDFs, with an OCR fallback (Optical Character Recognition — turning pictures of text into actual text) for scanned pages via pymupdf and Tesseract. Word, PowerPoint, and Excel each get their own extractor. trafilatura pulls clean article text out of web pages.

Not losing work. DBOS makes the ingestion pipeline durable — every step is checkpointed in PostgreSQL so a crash resumes instead of restarting. Part 2 shows this in action.

Remembering. SQLite — a complete database engine that lives in a single file, progress.db — holds your study sessions, achievements, quizzes, workshops, flashcard decks, and mindmaps.

Appendix: Abbreviations in this post

This series’ promise is “no unexplained abbreviations,” so here is the table I wish every technical tutorial shipped with.

Abbreviation	Full form	Plain-English meaning
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
RAG	Retrieval-Augmented Generation	Fetch relevant passages from your documents first, then let the model answer from them — instead of from its training memory
API	Application Programming Interface	The set of URLs the frontend calls to talk to the backend
ASGI	Asynchronous Server Gateway Interface	The Python standard that lets the server handle many requests concurrently
JSON	JavaScript Object Notation	The universal text format for structured data
NDJSON	Newline-Delimited JSON	A stream where each line is its own JSON object — ideal for streaming AI answers chunk by chunk
FAISS	Facebook AI Similarity Search	Meta’s library for storing vectors and finding the most similar ones fast
BM25	Best Match 25	A classic keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	A formula for merging multiple ranked result lists using only the ranks
ANN	Approximate Nearest Neighbour	A speed shortcut many vector databases take. CogniVault deliberately uses an exact index instead — precise, and plenty fast at personal-library scale
DBOS	Database-Oriented Operating System (the research project it grew from)	A library that checkpoints workflow steps in a database so crashed jobs resume
SQL / SQLite	Structured Query Language / SQLite	The language of relational databases / a tiny database that lives in one file
OCR	Optical Character Recognition	Turning pictures of text (scans) into machine-readable text
SHA-256	Secure Hash Algorithm, 256-bit	A fingerprint function — any file maps to a unique hash, used to detect changed files
CORS	Cross-Origin Resource Sharing	Browser rules controlling which websites may call the API
SSRF	Server-Side Request Forgery	An attack where a server is tricked into fetching internal URLs — the URL-import endpoint guards against it
MCQ	Multiple-Choice Question	One of the two quiz question types
KB	Knowledge Base	All your ingested, searchable documents

(Every claim in this series can be checked directly against the — the relevant file is named whenever it matters, and the repository README maps the full architecture.)

The takeaway

Strip away the abbreviations and CogniVault is a small system: one web server, one model runtime, one durability database, and a handful of files. The sophistication isn’t in the part count — it’s in how a few well-chosen pieces cooperate. That cooperation is what the next three parts are about.

Next up: — how a 1,000-page scanned PDF becomes something the AI can search in seconds, and why the pipeline survives a crash at page 800.

Part 8 · Testing a Local-AI App: 351 Tests, Zero Infrastructure

Mon, 25 May 2026 00:00:00 +0000

Part of a series on building . Previously: . All abbreviations are fully explained in the appendix at the bottom of the page.

CogniVault has 351 tests across 22 files (at the time of writing — the suite grows with the app). None of them need Ollama. None of them need Postgres. None of them need a real PDF, a microphone, or an internet connection. The whole suite runs in about three seconds on my laptop.

That’s not because there isn’t much to test — the surface is wide. It’s because the test suite is built around one principle: mock at the edge, real everywhere else. This post is about what “the edge” means in a local-AI app, and how to draw the line so the suite stays useful instead of decorative.

The 22 test files

File	What it covers
`test_api.py`	The HTTP endpoints (upload, ingest, RAG, history, KB browsing)
`test_tools.py`	Calculator, clock, KB search tool
`test_thinking.py`	Two-phase stream, thinking tokens, session isolation
`test_chat_attachments.py`	Multi-file attach, PDF/DOCX extraction, size limits
`test_chat_memory.py`	Session history budget, trimming, restart rebuild
`test_doc_scope_filter.py`	Per-request ContextVar isolation, search filtering
`test_doc_tools.py`	`list_documents`, `analyze_document`, `compare_documents`
`test_edit_regenerate.py`	History rewind, trim_history_to_turns validation
`test_structure_chunking.py`	Markdown header splits, CSV row batches, doc types
`test_ocr_fallback.py`	OCR trigger threshold, graceful degradation
`test_new_formats.py`	PPTX, XLSX, HTML extractors, extension routing
`test_docx_url.py`	DOCX ingestion and URL import (with the SSRF guard)
`test_reingest.py`	SHA-256 change detection, idempotency
`test_vector_db.py`	BM25, FAISS, RRF fusion, hybrid search
`test_audio.py`	Whisper transcription endpoint
`test_progress.py`	Sessions, daily aggregation, achievement criteria
`test_prompts.py`	The prompt-template loader and custom overrides
`test_vault_stats.py`	The Privacy Vault Audit numbers
`test_quiz.py` / `test_workshop.py` / `test_flashcards.py` / `test_mindmaps.py`	Per-mode parsing, endpoints, achievements

Everything that can be tested in isolation is tested in isolation. Everything that needs to be tested through the FastAPI layer is, but the only things mocked are the calls that cross the process boundary.

What gets mocked, what doesn’t

The single most important question in a project like this: where do you stub?

[ React frontend ] ←─ not in scope for backend tests
 │
 ▼
[ FastAPI handlers ] ←─ tested directly with TestClient
 │
 ▼
[ services/ ] ←─ tested directly (vector_db, rag_agent, generators)
 │
 ├─► [ FAISS + BM25 ] ←─ real, in-memory, fast
 ├─► [ SQLite ] ←─ real, against a tmp_path file
 ├─► [ DBOS ] ←─ patched (no launch, no Postgres)
 ├─► [ Ollama ] ←─ patched at each service's import site
 └─► [ Whisper ] ←─ stubbed (no 145 MB model load)

The rule of thumb: anything that crosses a process or network boundary, mock. Anything in-process, run for real.

FAISS and BM25 are real because they’re libraries we link into the test process. SQLite is real because it’s a file. DBOS is patched because launching it expects a Postgres connection, and that’s network. Ollama is patched because it’s HTTP. Whisper is stubbed because loading a 145 MB model in a unit test is silly.

That principle keeps the test suite fast (no I/O the OS can’t handle in milliseconds) and meaningful (the real code paths through retrieval, chunking, parsing, scope filtering all execute).

Mocking Ollama

Most CogniVault tests need some model output, but they don’t care what model produced it. Each service imports the ollama module directly, so the tests patch that reference at the service’s own import site:

# Real pattern from test_quiz.py
from unittest.mock import patch
from backend.services import quiz_generator

def test_quiz_parses_questions():
 fake = {"message": {"content": json.dumps({"questions": [VALID_MCQ] * 5})}}
 with patch.object(quiz_generator, "ollama") as mock_ollama:
 mock_ollama.chat.return_value = fake
 result = quiz_generator.generate_quiz(
 difficulty="beginner", num_questions=5, question_types=["mcq"],
 )
 assert len(result.questions) == 5

A streaming variant feeds chunk sequences instead of a single response, used by the RAG and thinking tests. The key property: one patch.object against the module the service actually uses. No deep mock hierarchies, no fragile string paths into third-party internals. Easy to read in a code review, easy to debug when a test fails.

Mocking DBOS

DBOS expects launch() to connect to Postgres. The shared client fixture in conftest.py simply patches the dbos instance before the app is exercised:

# Real pattern from conftest.py
@pytest.fixture()
def client():
 """A FastAPI TestClient with DBOS launch mocked out — no Postgres needed."""
 with patch("backend.services.ingest.dbos") as mock_dbos:
 mock_dbos.launch = MagicMock()
 from backend.main import app
 with TestClient(app) as c:
 yield c

The decorated workflow steps still execute as ordinary Python functions — we lose the durability semantics, but the tests aren’t testing durability, they’re testing the business logic inside the steps (hash detection, extraction, chunking). The durability layer has its own tests upstream, in DBOS’s own suite.

There’s a second isolation layer that runs on every test automatically: an autouse fixture points the docs folder, FAISS index, and metadata file at a per-test tmp_path via environment variables, so no test can ever touch real data on disk.

Real SQLite, with one override

Progress tracking, achievements, quiz storage, deck CRUD — all SQLite. The progress tracker exposes a single test seam: a module-level path override.

# Real pattern from test_quiz.py
@pytest.fixture(autouse=True)
def _isolate_progress_db(tmp_path, monkeypatch):
 monkeypatch.setattr(progress_tracker, "_db_path_override",
 str(tmp_path / "progress_test.db"))

Every test gets a fresh database file; the schema auto-creates on first use. No connection pooling drama, no leaked state between tests, no in-memory :memory: gymnastics. Just a temp file per test.

This is the kind of test that catches bugs an SQL-level mock would never see — a missing index, a botched migration, a constraint that doesn’t fire. SQLite is fast enough on every machine I’ve ever owned that “use the real database” isn’t even a trade-off.

The TestClient pattern

For HTTP tests, FastAPI’s TestClient runs the app in-process. The upload, the validation, the chunking, the vector-store update, the response serialisation — every layer runs for real. Only the calls that would leave the process (the Ollama embedding call inside ingestion, the model call inside generation) are patched. That’s the right line: the test verifies the integration of those layers, but doesn’t depend on an external service.

The streaming endpoint tests use a slightly different style — they iterate the response body and parse each NDJSON line (one JSON envelope per line, as described in ) — but the principle is identical.

Coverage gaps I accept

Three things the test suite doesn’t cover:

The frontend. No React testing in this suite — that’s a separate concern. Most failures show up in API tests anyway, because the frontend is a thin client over a typed API.
Real Ollama prompt quality. Whether gemma4:e4b actually produces useful quiz questions is not a thing tests can answer. That’s evaluation, not testing. It belongs in a separate harness with a real model running.
Race conditions across DBOS workflow restarts. The resume path is exercised at the logic level, but the full state space of “what happens if Postgres goes away at this exact instant” is too large to enumerate.

These are conscious gaps. The test suite is for catching regressions in code I wrote; it’s not a replacement for evaluation, integration testing, or actual chaos engineering.

What the suite is actually for

Two things, in order:

Refactor confidence. When I rip out the agent loop and put a new one in, do the tests still pass? If yes, the API contracts I care about haven’t drifted.
PR review surface. Every PR runs the suite in CI. A green run is a precondition for merge. The suite is loud enough that a real regression makes the noise.

Notice what it isn’t for: proving the model works. It can’t. Tests can pin behaviour but they can’t pin quality. That’s a different muscle, and it belongs in a different harness.

What’s worth borrowing

If you’re building a local-AI app and your tests need Ollama running:

Patch the ollama module at each service’s import site with patch.object(service_module, "ollama") — one seam per service, no shims required.
Give your DB layer a path override and run against a tmp_path SQLite file.
Use an autouse fixture to redirect every on-disk artefact (docs folder, index files) to tmp_path, so no test can touch real data even by accident.
For each external service (model, audio, workflow engine), draw the seam at the process boundary. Test everything above it with real code.

The result is a suite where every test runs in any environment, finishes in milliseconds, and exercises the actual integration of every layer of code you wrote. 351 tests in about three seconds isn’t an optimisation, it’s a side-effect of mocking only at the edges.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
CI	Continuous Integration	Automatically running the test suite on every push/PR
PR	Pull Request	A proposed code change — merged only when the suite is green
API	Application Programming Interface	The HTTP surface the TestClient exercises in-process
HTTP	HyperText Transfer Protocol	The protocol the (in-process) endpoint tests speak
RAG	Retrieval-Augmented Generation	The retrieval-then-answer pipeline under test
KB	Knowledge Base	The indexed document collection
FAISS	Facebook AI Similarity Search	Real in tests — it’s an in-process library
BM25	Best Match 25	The keyword index — also real in tests
RRF	Reciprocal Rank Fusion	The rank-merging formula covered by `test_vector_db.py`
SQLite / SQL	(SQL = Structured Query Language)	The real, file-based database every progress test runs against
DBOS	Database-Oriented Operating System	The durable-workflow library — patched so no Postgres is needed
OCR	Optical Character Recognition	The scanned-PDF fallback with its own trigger-threshold tests
SSRF	Server-Side Request Forgery	The URL-import attack class covered in `test_docx_url.py`
NDJSON	Newline-Delimited JSON	The streaming format the endpoint tests parse line by line
SHA-256	Secure Hash Algorithm, 256-bit	The content fingerprint behind the re-ingest tests
CRUD	Create, Read, Update, Delete	The basic storage operations for decks, quizzes, and maps
PDF / DOCX / PPTX / XLSX / HTML	Portable Document Format / Word / PowerPoint / Excel / HyperText Markup Language	The extractor formats with dedicated tests

That’s the series. Eight posts on the parts of I’m most proud of — and a handful I’d build differently. If any of it was useful to you, the code is open source at , and the is on YouTube.

Your data. Your hardware. Your AI. Your vault.

Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts

Thu, 30 Apr 2026 00:00:00 +0000

Part of a series on building . Previously: . All abbreviations are fully explained in the appendix at the bottom of the page.

When I first wired up Gemma 4 with inside CogniVault, the chat felt slow. Not laggy — slow in a way that’s worse than laggy. The user types a question. The cursor sits there. Then, eventually, an answer drops out of the void.

The model wasn’t idle. It was thinking. Gemma 4 has a chain-of-thought mode that produces a (sometimes long) reasoning trace before its final reply. With a single-phase agent stream, all of that thinking is happening inside the agent loop — silently — before any tool calls run, before any tokens get emitted to the UI.

So I split the call into two phases.

The shape

POST /rag
 │
 ├── Phase 1 — Direct Ollama call, thinking enabled
 │ stream: {"type":"thinking","data":"..."} (reasoning tokens)
 │
 └── Phase 2 — Strands Agent (thinking disabled)
 stream: {"type":"metadata","data":{...}} (citations, as soon as search runs)
 stream: {"type":"text","data":"..."} (answer tokens)
 stream: {"type":"memory","data":{...}} (end-of-stream: session memory usage)

The endpoint streams newline-delimited JSON (NDJSON): each line of the response body is one self-contained JSON envelope with a type and a data. The frontend dispatches on type and renders accordingly: a collapsible reasoning panel for the thinking tokens, the main message bubble for the text tokens, a sidebar card per citation.

The user sees the model start thinking immediately. Latency to first byte drops from “long enough to wonder if it crashed” to “instant.” Total time to final answer doesn’t change. Perceived speed does.

Phase 1 — Thinking only

Phase 1 is a single direct call to Ollama with thinking enabled. It gets exactly what Phase 2 will see — the same system prompt, the current question, and any attached images — so the reasoning reflects reality. Only the reasoning tokens are consumed; whatever answer text Phase 1 starts to produce is discarded, because we don’t want a half-formed answer competing with the real one.

# Simplified from backend/services/rag_agent.py
client = ollama.AsyncClient(host=settings.ollama_host)
stream = await client.chat(
 model=settings.llm_model,
 messages=[
 {"role": "system", "content": system_prompt},
 {"role": "user", "content": query, "images": images},
 ],
 options={"thinking": True},
 stream=True,
)
async for chunk in stream:
 if chunk.message.thinking:
 yield envelope("thinking", chunk.message.thinking)

Phase 1 is deliberately best-effort: any failure here is swallowed and logged, and the stream moves straight on to Phase 2. A broken reasoning panel should never cost the user their answer.

Phase 2 — Agent with tools

Phase 2 builds a fresh Strands Agent per request — no shared mutable state between concurrent chats — restores the session’s conversation history into it, and runs the tool loop with six tools registered:

Tool	Purpose
`search_knowledge_base(query)`	Hybrid FAISS + BM25 search, top-7, RRF fusion. Scope-filter-aware.
`list_documents()`	Inventory of every indexed file with type and chunk count.
`analyze_document(filename)`	Inner Gemma call → structured summary (topics, entities, key facts).
`compare_documents(doc_a, doc_b, question)`	Inner Gemma call answering across two documents.
`calculator(expression)`	Safe AST evaluator — no `eval()`, no arbitrary code.
`current_time()`	Timestamp for time-aware queries.

The agent decides which tools to call and in what order. There’s no hard-coded router; the system prompt explains what’s available and Strands handles the loop. For most document questions the path is: search_knowledge_base → answer. For comparisons: compare_documents → answer. For “what files do I have?”: list_documents → answer. For greetings and arithmetic, the system prompt tells the agent it may skip search entirely. The model picks.

Two details that took debugging to get right:

Phase 2 runs with thinking explicitly disabled. Without that flag, Gemma’s default behaviour can leak <think>…</think> tags into the visible answer, and everything before the closing tag gets swallowed by the Markdown renderer. One model option — options={"thinking": False} — fixed a “truncated responses” bug that looked much scarier than it was.
Citations are flushed before the first answer token. Tools run before text deltas arrive, so by the time the first visible token streams, every source the search found is already in the sidebar. The accumulator is a request-local ContextVar the search tool appends to.

# Simplified — the real loop reads Strands' raw event dicts
async for event in agent.stream_async(user_input):
 delta = event["event"].get("contentBlockDelta", {}).get("delta", {}).get("text")
 if delta:
 for doc in new_citations(): # drain the ContextVar accumulator
 yield envelope("metadata", doc)
 yield envelope("text", delta)

Why this matters more than it sounds

You could implement similar behaviour with one agent call that interleaves thinking events with text events. The reasons I split it anyway:

The thinking model and the tool model can be different. Right now they’re both gemma4:e4b, but the architecture lets me swap a smaller, faster model in for Phase 1 reasoning and keep the big one for Phase 2 tool use. I’m not doing that yet — but I want the option.
Phase 1 always streams immediately. A pure agent loop only starts producing tokens after the model has decided what to say. Two-phase guarantees the user sees activity almost as soon as they press Enter, regardless of how complex the Phase 2 tool work gets.
Failures isolate. If Phase 2 falls over (Ollama timeout, tool error), Phase 1’s reasoning is still visible — the user can see what the model was trying to do, which makes the error far less frustrating than a blank “something went wrong.”

ContextVar isolation, again

The same ContextVar trick that scopes retrieval in carries here. At the start of each /rag stream, the handler sets two request-local variables: the document-scope filter and the citation accumulator. The agent’s tools read and write them implicitly. Conversation history itself lives in a per-session store guarded by per-session asyncio locks, so two concurrent requests in the same chat can’t corrupt each other either.

Tested with two browser tabs open on the same backend, scoped to different document categories, sending overlapping queries simultaneously. Zero cross-contamination. The test suite covers this explicitly in test_thinking.py and test_doc_scope_filter.py — see for the broader story.

The frontend side of the contract

A detail that tripped me up: this is a POST endpoint, so the browser’s EventSource API (which only does GET) is out. The frontend uses fetch and reads the response body incrementally, splitting on newlines and parsing each line as JSON:

// Simplified from useRagStream.ts
const res = await fetch("/rag", {
 method: "POST",
 body: JSON.stringify(payload),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
 const { done, value } = await reader.read();
 if (done) break;
 buffer += decoder.decode(value, { stream: true });
 const lines = buffer.split("\n");
 buffer = lines.pop()!; // keep the trailing partial line
 for (const line of lines) {
 if (!line.trim()) continue;
 const { type, data } = JSON.parse(line);
 switch (type) {
 case "thinking":
 appendThinking(data);
 break;
 case "text":
 appendText(data);
 break;
 case "metadata":
 addCitation(data);
 break;
 case "memory":
 updateMemoryMeter(data);
 break;
 }
 }
}

The reasoning panel starts collapsed, with a small pulsing indicator while thinking tokens are still streaming — enough to signal “the model is working” without shoving a wall of chain-of-thought at the user. One click expands the full trace, during or after the stream.

What I’d revisit

Phase 1 reasons toward a full answer, and we throw the answer part away. A dedicated “plan your approach, don’t answer yet” prompt for Phase 1 would make the reasoning trace tighter and cheaper. Today it shares the main system prompt — simpler, but the trace can ramble.
No interrupt yet. Once Phase 1 starts, it runs to completion. If the user types a follow-up mid-stream we let it finish. A real cancel button would mean wiring an abort signal through Ollama’s HTTP client — feasible, not yet done.
Phase 1 occasionally over-thinks. Greetings and trivial questions still produce a paragraph of reasoning. A “should I think?” gate (probably a tiny classifier or even a heuristic on query length) would skip Phase 1 entirely for those cases.

Takeaway

Streaming is not just an optimisation. It’s a UX primitive. Two-phase streaming buys you a free property: the visible part of the interaction starts before the slow part does. The user gets to watch the model think, which is — genuinely — more interesting than watching a spinner.

If your agent app feels slow even though the answers are fast, look at when tokens start flowing. The fix often isn’t a faster model.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
NDJSON	Newline-Delimited JSON	A stream where each line is its own complete JSON object — what `/rag` emits
JSON	JavaScript Object Notation	The universal text format for structured data
UX	User Experience	How the product feels to use — the real beneficiary of two-phase streaming
UI	User Interface	The visible surface the stream renders into
FAISS	Facebook AI Similarity Search	The dense half of hybrid retrieval (previous post)
BM25	Best Match 25	The keyword half of hybrid retrieval (previous post)
RRF	Reciprocal Rank Fusion	The rank-only formula that merges the two result lists
AST	Abstract Syntax Tree	The parsed form of an expression — how the calculator evaluates maths without `eval()`
HTTP	HyperText Transfer Protocol	The protocol carrying the stream
SSE	Server-Sent Events	The browser’s built-in GET-only streaming format — notably not usable here, because `/rag` is a POST
API	Application Programming Interface	The boundary the frontend calls

Next up: — how CogniVault re-ingests edited PDFs without re-embedding everything, and survives a kill -9 mid-pipeline.