RAG |

CogniVault Backend Explained, Part 1 · Meet the Backend: Three Processes, Four Layers

Fri, 12 Jun 2026 00:00:00 +0000

When people first open the CogniVault repository, the question I hear most is some version of: “Where do I even start?” There’s a RAG agent, a FAISS index, a DBOS workflow, an Ollama host — and if you’re transitioning into tech, every one of those words is a closed door.

This series opens the doors one at a time. No prior RAG knowledge assumed, every abbreviation spelled out, and every claim checkable against the . If you’ve already read my , think of this series as the guided tour that should have come first.

Let’s map this out.

The whole app is three processes

CogniVault lets you chat with your own documents and turn them into quizzes, workshops, flashcards, and mindmaps — and nothing ever leaves your machine. (The why behind that constraint is its own story: .)

You might expect an app like that to be a sprawl of microservices. It’s three processes:

Process	What it does
The Python backend	One FastAPI app on port 8000 — it also serves the compiled React frontend as static files
Ollama	The local model server on port 11434, running the AI models
PostgreSQL	One Docker container, used only for workflow checkpoints — never for your documents

Everything else — your files, the search index, your chat history, your quiz scores — is a plain file on disk. That’s not laziness; it’s the privacy argument made physical. You can open every byte the app stores with a text editor and a SQLite browser.

The four layers

Before we name technologies, here’s the mental model I want you to keep for the whole series. The backend is four layers, top to bottom:

Layer 1 — the web layer. A FastAPI application receives every HTTP request and routes it to one of six routers: chat (/rag), knowledge management (/upload, /ingest), study tools (/api/study/*), progress (/api/progress/*), voice (/api/transcribe), and chat history (/api/history). FastAPI (a modern Python web framework) also auto-generates interactive API documentation at /api/docs, which is the best way to explore the backend without reading a line of code.

Layer 2 — the intelligence layer. Two AI models with two different jobs. gemma4:e4b generates: chat answers, reasoning, image analysis, and tool calls. embeddinggemma embeds: it turns text into vectors (lists of numbers that capture meaning) so similar ideas can be found mathematically. Both run inside Ollama — think of Ollama as Docker, but for AI models.

Layer 3 — the retrieval layer. A search engine over your documents that combines semantic search (find things that mean the same) with keyword search (find the exact string). Part 3 of this series is entirely about this layer.

Layer 4 — the persistence layer. Four storage systems, each picked for one job: a FAISS index plus a JSON file for searchable knowledge, SQLite for study data, PostgreSQL for workflow checkpoints, and plain JSON files for chat history.

One diagram, every major piece

flowchart TB subgraph CLIENT["Browser"] UI["React Frontend
(compiled, served by FastAPI)"] end subgraph SERVER["FastAPI Backend — port 8000"] ROUTERS["6 Routers
rag · knowledge · study ·
progress · audio · history"] AGENT["RAG Agent
(Strands SDK, 6 tools)"] VDB["VectorDB
FAISS + BM25 + RRF"] INGEST["Ingestion
(DBOS durable workflow)"] GEN["Study generators
quiz · workshop · cards · mindmap"] PROG["Progress tracker
+ 25 achievements"] end subgraph OLLAMA["Ollama — port 11434"] GEMMA["gemma4:e4b
chat · thinking · vision · tools"] EMBED["embeddinggemma
text to vectors"] end subgraph STORAGE["Local storage"] FAISSF["vector_store.faiss + .json"] SQLITE["progress.db (SQLite)"] PG["PostgreSQL
workflow state only"] DOCS["docs/ folder + chat_history.json"] end UI --> ROUTERS ROUTERS --> AGENT --> VDB AGENT --> GEMMA VDB --> EMBED ROUTERS --> INGEST --> EMBED INGEST --> PG INGEST --> FAISSF VDB --- FAISSF ROUTERS --> GEN --> GEMMA GEN --> SQLITE ROUTERS --> PROG --> SQLITE ROUTERS --> DOCS

Keep this picture handy — Parts 2, 3, and 4 each zoom into one region of it.

The tech stack, and why each piece earned its place

The full dependency list lives in requirements.txt. Here’s what matters, grouped by job:

Serving requests. FastAPI defines the endpoints and validates every request and response with Pydantic (a data-validation library — think of it as a strict customs officer for JSON). Uvicorn is the ASGI server (Asynchronous Server Gateway Interface — the Python standard that lets one process juggle many simultaneous requests) that actually runs it.

Thinking. Ollama serves gemma4:e4b — the e4b tag is the roughly four-billion effective-parameter variant, about a 9.6 GB download — and embeddinggemma (about 622 MB). The agent behaviour is built with the Strands Agents SDK, which wraps the model in a loop where it can call tools, read the results, and only then answer. (Where I run Ollama relative to Docker is a deliberate choice with a story behind it: .)

Finding things. FAISS (Facebook AI Similarity Search — Meta’s vector search library) handles semantic lookups; rank-bm25 handles keyword lookups; a formula called Reciprocal Rank Fusion merges the two. Part 3 unpacks all of this.

Reading documents. pypdf for PDFs, with an OCR fallback (Optical Character Recognition — turning pictures of text into actual text) for scanned pages via pymupdf and Tesseract. Word, PowerPoint, and Excel each get their own extractor. trafilatura pulls clean article text out of web pages.

Not losing work. DBOS makes the ingestion pipeline durable — every step is checkpointed in PostgreSQL so a crash resumes instead of restarting. Part 2 shows this in action.

Remembering. SQLite — a complete database engine that lives in a single file, progress.db — holds your study sessions, achievements, quizzes, workshops, flashcard decks, and mindmaps.

Appendix: Abbreviations in this post

This series’ promise is “no unexplained abbreviations,” so here is the table I wish every technical tutorial shipped with.

Abbreviation	Full form	Plain-English meaning
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
RAG	Retrieval-Augmented Generation	Fetch relevant passages from your documents first, then let the model answer from them — instead of from its training memory
API	Application Programming Interface	The set of URLs the frontend calls to talk to the backend
ASGI	Asynchronous Server Gateway Interface	The Python standard that lets the server handle many requests concurrently
JSON	JavaScript Object Notation	The universal text format for structured data
NDJSON	Newline-Delimited JSON	A stream where each line is its own JSON object — ideal for streaming AI answers chunk by chunk
FAISS	Facebook AI Similarity Search	Meta’s library for storing vectors and finding the most similar ones fast
BM25	Best Match 25	A classic keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	A formula for merging multiple ranked result lists using only the ranks
ANN	Approximate Nearest Neighbour	A speed shortcut many vector databases take. CogniVault deliberately uses an exact index instead — precise, and plenty fast at personal-library scale
DBOS	Database-Oriented Operating System (the research project it grew from)	A library that checkpoints workflow steps in a database so crashed jobs resume
SQL / SQLite	Structured Query Language / SQLite	The language of relational databases / a tiny database that lives in one file
OCR	Optical Character Recognition	Turning pictures of text (scans) into machine-readable text
SHA-256	Secure Hash Algorithm, 256-bit	A fingerprint function — any file maps to a unique hash, used to detect changed files
CORS	Cross-Origin Resource Sharing	Browser rules controlling which websites may call the API
SSRF	Server-Side Request Forgery	An attack where a server is tricked into fetching internal URLs — the URL-import endpoint guards against it
MCQ	Multiple-Choice Question	One of the two quiz question types
KB	Knowledge Base	All your ingested, searchable documents

(Every claim in this series can be checked directly against the — the relevant file is named whenever it matters, and the repository README maps the full architecture.)

The takeaway

Strip away the abbreviations and CogniVault is a small system: one web server, one model runtime, one durability database, and a handful of files. The sophistication isn’t in the part count — it’s in how a few well-chosen pieces cooperate. That cooperation is what the next three parts are about.

Next up: — how a 1,000-page scanned PDF becomes something the AI can search in seconds, and why the pipeline survives a crash at page 800.

CogniVault Backend Explained, Part 2 · From File to Searchable Knowledge

Fri, 12 Jun 2026 00:00:00 +0000

An LLM cannot “open” your PDF. That sentence surprises a lot of newcomers, so let’s sit with it for a second: when you chat with your documents in CogniVault, the model never touches the original files. Something has to happen between “I dropped a file into the browser” and “the AI just quoted page 47 back at me.”

That something is ingestion, and it’s the subject of this part. In we drew the whole map; today we zoom into one region — the conveyor belt that turns files into searchable knowledge.

The conveyor belt

Think of ingestion as a four-station assembly line:

Extract the text out of each file — even scanned ones.
Chunk it into pieces small enough to fit into a prompt.
Embed each chunk — turn it into a vector (a list of numbers that captures its meaning) so similar ideas land near each other in vector space.
Store vectors and metadata so they can be searched later.

flowchart TD A["Upload
POST /upload
saved to docs/"] --> B subgraph WF["DBOS durable workflow"] B["Step 1
Which files changed?
SHA-256 fingerprints"] --> C["Step 2
Extract text
per-format + OCR fallback"] C --> D["Chunk
1000 chars, 100 overlap"] D --> E["Step 3
Embed
embeddinggemma, batches of 5"] E --> F["Step 4
Save
FAISS index + metadata JSON"] end F --> G["Reload in-memory index
instantly searchable"]

Simple enough. The interesting engineering is in the failure cases — so let’s start there.

The factory ledger: why the pipeline can’t lose work

Embedding a large library takes minutes. What happens when your laptop goes to sleep at page 800 of a 1,000-page manual? With a plain Python script: everything restarts from page 1.

CogniVault instead writes the pipeline as a DBOS durable workflow. Picture a factory where every station stamps a permanent ledger the moment it finishes a box. If the power cuts out, nobody rebuilds finished boxes — the workers read the ledger and resume at the first unstamped entry.

DBOS is that ledger, and PostgreSQL is the book it’s written in. Each pipeline station is a checkpointed step; on restart, completed steps return their recorded results instantly and execution continues from the first unfinished one. A failed embedding batch is simply retried.

This is also what powers the live progress timeline in the UI: starting an ingestion returns a workflow_id, and the frontend polls a status endpoint that reports which steps have completed, which are running, and which are still waiting.

I wrote a whole deep dive on this mechanism — including what happens when you kill -9 the process mid-ingest — in .

Fingerprints, not faith: SHA-256 change detection

Re-embedding your whole library every time you add one file would be wasteful. So before any work happens, the pipeline computes each file’s SHA-256 hash (a content fingerprint — change one character in the file and the fingerprint changes completely) and compares it to the fingerprint stored with the file’s existing chunks:

Never seen before → ingest it.
Fingerprint changed → the old chunks are soft-deleted and the file is re-ingested.
Fingerprint identical → skip it entirely.

Why “soft”-deleted? Because the FAISS index type CogniVault uses cannot remove individual vectors. Stale chunks are just marked deleted: true in the metadata; their vectors stay in the index but every search filters them out. It’s an honest, boring solution — and it never corrupts the index.

Every format gets its own treatment

Here’s a detail that separates a demo from a product. A naive pipeline extracts “all the text” and calls it a day. CogniVault gives each format an extractor that preserves the structure that retrieval will need later:

Format	Strategy
PDF	Page by page, keeping page numbers (those become citations later). Any page yielding fewer than 50 characters is presumed scanned and sent to OCR
Scanned page	The page is rendered to an image at roughly 144 dpi, then Tesseract OCR (Optical Character Recognition — reading text out of images) extracts the words
Markdown	Split on headings; each section chunk gets a breadcrumb prefix like `[Section: Intro > Setup]` so its embedding carries the document hierarchy
CSV	Rows grouped 20 per chunk — and every chunk is prefixed with the header row, so the model always knows the column names
Excel	Same row-group idea per sheet, prefixed `[Sheet: name]`
PowerPoint	One chunk per slide
Word	Paragraphs plus table cells
Web pages	Fetched on request and stripped to clean article text — behind an SSRF guard (Server-Side Request Forgery protection: the server refuses to fetch private or internal addresses)

Ask yourself why the CSV detail matters. If chunk 14 of a spreadsheet is just twenty naked rows of numbers, no search will ever connect it to the question “what was the Q3 budget?” Prefix it with the header row, and the chunk knows it contains budget columns. Structure is retrieval fuel.

Chunking: 1,000 characters with a 100-character safety overlap

Long text is split into pieces of about 1,000 characters, with neighbouring pieces overlapping by 100. The overlap is insurance: a sentence sliced at a chunk boundary still appears whole in one of the two neighbours, so no idea falls into the gap between chunks.

Embedding and saving

Chunks are embedded by embeddinggemma (via Ollama) in batches of five — each chunk becomes one vector. The vectors are normalised and appended to a FAISS index; alongside it, a JSON file records each chunk’s source filename, page number, category, fingerprint, and the text itself. The index holds the numbers; the JSON holds the meaning.

One choice worth highlighting for beginners: this is an exact index, not an approximate one. Many vector databases use ANN (Approximate Nearest Neighbour) shortcuts that trade a little accuracy for speed at massive scale. At personal-library scale you don’t need the trade — CogniVault checks every vector on every search and is still fast.

The whole journey, end to end

%%{init: {'sequence': {'actorFontSize': 28, 'messageFontSize': 24, 'loopTextFontSize': 22, 'noteFontSize': 22}}}%% sequenceDiagram actor U as You participant F as Frontend participant B as FastAPI participant W as DBOS Workflow participant O as Ollama (embeddinggemma) participant V as FAISS + metadata U->>F: Drag and drop a file, pick a category F->>B: POST /upload B->>B: Validate type and size, save to docs/ F->>B: POST /ingest B->>W: Start durable workflow B-->>F: workflow_id loop Poll status F->>B: GET /ingest/status/{workflow_id} B-->>F: Step list (drives the progress timeline) end W->>W: SHA-256 change detection W->>W: Extract text (per format, OCR if scanned) W->>W: Chunk (1000 chars / 100 overlap) W->>O: Embed in batches of 5 O-->>W: Vectors W->>V: Append vectors + metadata B-->>F: SUCCESS — index reloaded F-->>U: "Knowledge Sync Complete"

The takeaway

Ingestion is where most RAG quality is actually won or lost — long before any clever prompting. Page numbers preserved, headers carried into every spreadsheet chunk, scans rescued by OCR, and a ledger that makes the whole thing crash-proof: none of it is glamorous, all of it shows up later as answers that cite the right page.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
DBOS	Database-Oriented Operating System	The library that checkpoints workflow steps in PostgreSQL so crashed jobs resume
SHA-256	Secure Hash Algorithm, 256-bit	A content fingerprint — change one byte of a file and the hash changes completely
OCR	Optical Character Recognition	Reading text out of images — the rescue path for scanned PDF pages
SSRF	Server-Side Request Forgery	An attack where a server is tricked into fetching internal URLs; the URL importer blocks it
FAISS	Facebook AI Similarity Search	The vector index the embeddings are appended to
ANN	Approximate Nearest Neighbour	The accuracy-for-speed shortcut CogniVault deliberately does not take
dpi	Dots Per Inch	Image resolution — scanned pages are rendered at ~144 dpi before OCR
JSON	JavaScript Object Notation	The format of the chunk-metadata file beside the FAISS index
PDF / CSV	Portable Document Format / Comma-Separated Values	Two of the eight-plus supported file formats
API	Application Programming Interface	The endpoints (`/upload`, `/ingest`, `/ingest/status/…`) driving the flow

Next up: — hybrid retrieval, the six-tool agent, and the two-phase stream that shows the model think before it answers.

CogniVault Backend Explained, Part 3 · How a Question Becomes a Cited Answer

Fri, 12 Jun 2026 00:00:00 +0000

You type a question. A few seconds later you get an answer with footnotes — the exact documents and pages it came from. This part walks through everything that happens in between.

In we built the knowledge base: every document chunked, embedded, and indexed. Now we get to use it — and this is where CogniVault stops being a pipeline and starts being interesting.

Two librarians, because one keeps failing you

Imagine a library with one librarian who organises everything by vibe. Ask her about “server downtime procedures” and she’s brilliant — she understands what you mean and finds documents that discuss the concept, whatever words they use. But ask her for “Error Code 404B” and she shrugs, handing you general networking guides. She doesn’t do exact strings.

Down the hall is a second librarian with a card catalogue. He finds the exact string “404B” instantly — but ask him a conceptual question phrased differently from the source text, and he finds nothing at all.

These are the two halves of search:

Semantic search (FAISS) — your question is embedded into a vector, and the index finds chunks whose vectors point the same way (technically: cosine similarity — how closely two arrows align). Great for meaning, blind to exact identifiers.
Keyword search (BM25) — a scoring formula that rewards chunks containing your exact words, weighted by how distinctive those words are. Great for identifiers, blind to synonyms.

CogniVault asks both librarians every time, then merges their answers with Reciprocal Rank Fusion (RRF) — a formula that combines ranked lists using only the positions:

score(chunk) = sum over both lists of 1 / (60 + rank)

A chunk ranked highly by either librarian scores well; a chunk both of them liked floats to the top. The elegance is what’s missing: you never have to reconcile FAISS’s similarity scores with BM25’s completely different scale, because ranks are the only input. The constant 60 comes straight from the original 2009 research paper, and yes, it’s cited in the code.

A few implementation details worth knowing: both searches deliberately over-fetch (at least 20 candidates each) so the fusion has material to work with; very weak semantic matches are dropped, but a keyword-perfect chunk can still be rescued through fusion; and the final answer uses the top 7 chunks. I benchmarked this whole setup against pure vector search in if you want the war stories.

The agent: a model that decides for itself

Here’s the second idea that trips up beginners: CogniVault’s chat is not “paste chunks into a prompt, get an answer.” It’s an agent — a model running in a loop where it can choose to call tools, read their results, and only then answer.

Built with the Strands Agents SDK, the agent gets six tools:

Tool	Job
`search_knowledge_base`	The core RAG tool — runs the hybrid search above, returns chunks with source and page
`list_documents`	See what’s in the vault
`analyze_document`	Structured analysis of one document: topics, entities, facts, summary
`compare_documents`	Answer a question by comparing two documents side by side
`calculator`	Safe maths — the expression is parsed into a syntax tree and only whitelisted operators run. No `eval()`, ever
`current_time`	The date and time

There is no hard-coded routing. The model reads your question and decides which tools to call, guided by its system prompt. Ask “compare the two contracts on termination clauses” and it reaches for compare_documents; ask “what’s 15% of 2,340” and it uses the calculator instead of hallucinating arithmetic.

Two safety details I want beginners to notice, because they’re the difference between a toy and a product: a fresh agent is constructed for every request (no shared state bleeding between concurrent chats), and the document-analysis tools call the model directly rather than through the agent — otherwise an agent calling a tool that calls the agent could recurse forever.

Watching the model think

When you send a message, the response streams back as NDJSON (Newline-Delimited JSON — each line of the stream is its own small JSON object). And it arrives in two phases:

Phase 1 — thinking. Gemma’s reasoning chain streams first, rendered in the collapsible panel above the answer. It’s deliberately best-effort: if it fails for any reason, the answer still comes.

Phase 2 — the agent answer. Tools run, citations appear in the Sources panel the moment the search completes — before the answer finishes writing — and the answer text streams in.

flowchart TB Q["Your question
(plus optional images, files, scope)"] --> P1 subgraph STREAM["POST /rag — one NDJSON stream"] P1["Phase 1: Thinking
reasoning chunks stream first"] P1 --> P2["Phase 2: Agent
fresh per request, history restored"] P2 -->|"decides to call"| T["search_knowledge_base"] T --> D["FAISS
semantic"] T --> S["BM25
keywords"] D --> RRF["RRF fusion — top 7 chunks"] S --> RRF RRF -->|"chunks + citations"| P2 P2 --> OUT["citations, then answer text,
then a memory-usage report"] end

Each line in the stream is typed: thinking, metadata (a citation), text (answer), memory (how full the conversation budget is), or error. The frontend just reads lines and routes them to the right panel. I dissected this design — and why thinking comes before the tool calls — in .

A memory budget, not a bottomless pit

Gemma’s context window (the amount of text the model can consider at once) is 128K tokens, but CogniVault doesn’t let conversation history sprawl across all of it. Each chat session gets a budget of 48,000 characters — roughly 12,000 tokens. Exceed it, and the oldest question-answer pair quietly drops out first, keeping the bulk of the window free for what matters: your current question and the retrieved chunks.

Two resilience touches worth stealing for your own projects:

Restart survival. In-memory history dies with the process. So the first message in a session after a backend restart rebuilds its history from the chat log the frontend persists. Multi-turn memory survives reboots.
Edit and regenerate. Editing an earlier message rewinds the stored history to that point before re-asking — the model genuinely forgets the timeline that no longer exists.

Scope: pinning the AI to specific documents

One last feature, and a lesson about small local models. You can pin a chat to specific files or a category. The filter travels with the request and a mandatory-search instruction is injected into both the system prompt and the user message itself.

Why both? Because small models sometimes skip instructions that live only in the system prompt — but they can’t ignore what’s inside the question. Belt and braces. When you work with 4-billion-parameter models instead of frontier ones, you learn to make instructions impossible to miss rather than hoping they’re followed.

The takeaway

A cited answer is four systems cooperating: two retrievers covering each other’s blind spots, a fusion formula that needs nothing but ranks, an agent that picks its own tools, and a stream that shows its work. None of the four is exotic on its own — the product is the cooperation.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them
FAISS	Facebook AI Similarity Search	The semantic (meaning-based) half of hybrid search
BM25	Best Match 25	The keyword half — a classic ranking formula from the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	Merges the two ranked lists using only each chunk’s rank: `score = Σ 1/(60 + rank)`
NDJSON	Newline-Delimited JSON	A stream where each line is its own complete JSON object — the chat response format
JSON	JavaScript Object Notation	The universal text format for structured data
AST	Abstract Syntax Tree	The parsed form of an expression — how the calculator does maths without `eval()`
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
SDK	Software Development Kit	A library of building blocks — here, Strands, which provides the agent loop
K (in 128K)	Kilo (thousand)	128K tokens ≈ 128,000 tokens — Gemma’s context window

Next up: — the same machinery pointed at generating quizzes, workshops, flashcards, and mindmaps, plus a table of every byte the app stores and exactly where it lives.

Part 1 · CogniVault Architecture: Why Standard RAG Isn't Enough (Hybrid Search)

Mon, 01 Jun 2026 00:00:00 +0000

Vector search is the process of finding the most similar items in a dataset based on their vector embeddings. This is how RAG systems usually work. But what happens when you need to find the most similar items in a dataset based not only on their semantic meaning but also on the exact wording of the query?

This becomes critical when the information you’re looking for isn’t just related but must match a specific string or keyword exactly.

Two ways of finding a book

Picture a good local bookshop. The owner has read everything, and she recommends by feel. Tell her you loved The Martian and she hands you Project Hail Mary — different title, different plot, but the same DNA: a lone scientist, an impossible survival problem, jokes under pressure. Ask for “something like Pride and Prejudice” and you’ll walk out with Emma. She isn’t matching words. She’s matching meaning.

Now ask her a different kind of question: “I need the book with ISBN 978-0-553-41802-6,” or “the manual that mentions error code 404B on the cover.” Her superpower is useless here. No amount of literary intuition finds an exact string. For that, you walk to the till and check the catalogue — a boring, literal index that knows exactly which shelf holds which identifier, and nothing about vibes.

A well-run bookshop needs both. So does a well-run RAG system:

FAISS — Facebook AI Similarity Search (the well-read owner): a vector index that finds chunks of text whose meaning is mathematically close to your prompt. Brilliant for “how is the practical exam structured?”, blind to “§3 Absatz 2”.
BM25 — Best Match 25 (the catalogue): a classic keyword-scoring algorithm that rewards exact word matches, weighted by how rare and distinctive those words are. Brilliant for identifiers and quoted phrases, blind to paraphrase.

CogniVault runs both retrievers on every search — this is Hybrid Search — and then merges the two ranked lists with a formula called Reciprocal Rank Fusion (RRF). RRF scores each chunk purely by its position in each list: a chunk ranked highly by either retriever scores well, and a chunk both retrievers agree on rises to the top. Because only ranks are used, the two retrievers’ incompatible scoring scales never have to be reconciled.

The agent decides when to search

Here’s the part most diagrams get backwards (mine included, in an earlier draft): retrieval doesn’t happen before the model gets involved. It happens inside the model’s own loop.

CogniVault wraps Gemma in the Strands Agents SDK. The model receives your question along with a set of Tools (pre-written Python functions like search_knowledge_base, calculator, or compare_documents). It then reasons about the question and decides for itself whether — and which — tools to call. For most document questions it calls search_knowledge_base, reads the retrieved chunks, and only then writes its answer, grounded in what it found.

Here is the architectural blueprint of that loop:

graph TD Client[📱 User Query] --> App[🖥️ FastAPI Server] subgraph AgentLoop["The Strands Agent Loop (powered by Gemma 4)"] App --> Agent[🧠 Agent reasons about the question] Agent -->|Decides to search| Search[search_knowledge_base] subgraph Hybrid Search Engine Search -->|Semantic| FAISS[(FAISS Vector)] Search -->|Exact match| BM25[(BM25 Keyword)] FAISS --> RRF{RRF Fusion} BM25 --> RRF end RRF -->|Best chunks + citations| Agent Agent -->|Grounded answer| Answer[Streamed response] end Answer --> Client

One subtlety worth noting: the agent is Gemma. There is no separate “formatting model” at the end — the same model that decided to search also writes the final answer, now with the retrieved chunks in front of it.

What’s Next?

Building a toy RAG app is easy, but building one that actually retrieves the exact document you need requires hybrid engines and an agent that knows when to use them.

Want to see how this system safely ingests massive documents without losing work when something crashes? Read Part 2: Durable Ingestion with DBOS

Or, if you prefer to jump straight into the code, the hybrid search lives in backend/services/vector_db.py of the .

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory
FAISS	Facebook AI Similarity Search	Meta’s library for storing vectors and finding the most similar ones fast
BM25	Best Match 25	A keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	A formula that merges multiple ranked lists using only each item’s rank: `score = Σ 1/(k + rank)`
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
SDK	Software Development Kit	A library of building blocks — here, Strands, which provides the agent loop
API	Application Programming Interface	The set of URLs the frontend calls to talk to the backend
ISBN	International Standard Book Number	The unique identifier printed on every published book — the catalogue’s best friend

Gemma CogniVault

Mon, 25 May 2026 00:00:00 +0000

Overview

Gemma CogniVault is a 100% local, privacy-first AI study companion. Your documents stay on your hardware. Inference runs via Ollama on localhost. No telemetry, no embeddings sent to third parties, no exceptions. A live Privacy Vault Audit Panel confirms zero external connections at runtime.

It’s also genuinely capable — Gemma 4’s full surface (completion, vision, tools, reasoning) running on your laptop, wrapped in an app that turns your documents into quizzes, multi-lesson workshops, flashcard decks, and visual mindmaps, with a learning-progress dashboard and 25 achievement badges.

What’s inside

Layer	Technology
LLM & Embeddings	Ollama · `gemma4:e4b` · `embeddinggemma`
Agent Framework	Strands Agents SDK
Backend	FastAPI · Python 3.10+ · Pydantic
Vector Search	FAISS IndexFlatIP + BM25Okapi · Reciprocal Rank Fusion
Document Parsing	pypdf · python-docx · python-pptx · openpyxl · trafilatura
OCR	pytesseract · pymupdf · Pillow
Audio	faster-whisper
Workflow Engine	DBOS + PostgreSQL
Frontend	React 19 · TypeScript · Vite · Tailwind v4 · Framer Motion · TanStack Query

Four sections

Section	What it’s for
💬 Chat	Ask anything about your documents. Cited answers, scope filter, voice, attachments.
📚 Knowledge Base	Upload, categorise, and manage your documents. SHA-256 change detection on re-upload.
🎓 Study Hub	Four AI-powered study modes: Quiz · Workshop · Flashcards · Mindmaps.
📊 Dashboard	Total study time, current streak, 25 achievement badges, 90-day activity heatmap.

Highlights

🧠 Thinking Mode — collapsible reasoning panel streams Gemma 4’s chain of thought before the answer
🔍 Hybrid Retrieval — FAISS dense + BM25 keyword fused with Reciprocal Rank Fusion
🖼️ Multimodal — attach images, PDFs, and DOCX inline in chat
🛟 Durable workflows — DBOS-checkpointed ingestion; crash-safe and resumable
🏆 25 achievement badges — auto-tracked across chat, quizzes, workshops, flashcards, mindmaps
🔒 Vault Audit Panel — live “zero external connections” indicator

Writing about it

I’m publishing a series of posts unpacking the engineering decisions behind CogniVault — privacy framing, the retrieval stack, the agent loop, ingestion durability, getting JSON out of a local model, drawing mindmaps without a graph library, the gamification layer, and how the test suite avoids needing any infrastructure to run.

Try it

git clone https://github.com/ndimoforaretas/local-gemma-rag.git
cd local-gemma-rag
./scripts/setup.sh # one-time
./scripts/start.sh

Then open .

Part 4 · Crash-Resumable Ingestion: DBOS, SHA-256, and Surviving a kill -9

Tue, 05 May 2026 00:00:00 +0000

There are two things you absolutely don’t want your RAG ingestion pipeline to do:

Re-embed a 200-page PDF because you fixed a typo on page 12.
Lose its progress if you close the laptop lid halfway through.

The first wastes time and compute resources. The second leads to distrust in the system. Both have the same root: ingestion is treated like a fire-and-forget function, when it’s actually a long-running pipeline with intermediate state worth preserving.

CogniVault treats ingestion as a durable workflow. Specifically, a workflow checkpointed in Postgres, with content hashing for incremental work. This post walks through both pieces.

The pipeline

1. Scan docs/ → SHA-256 hash per file
 ├── New file → queue for embedding
 ├── Changed file → soft-delete old chunks, re-embed
 └── Unchanged → skip (idempotent)

2. Extract text → per-format extractor (PDF/OCR, DOCX, PPTX, XLSX, MD, CSV, TXT, HTML)
3. Chunk → RecursiveCharacterTextSplitter (1000 chars, 100 overlap)
4. Embed → embeddinggemma via Ollama, batches of 5
5. Save → append to FAISS IndexFlatIP + JSON metadata on disk

The heavy stages run as DBOS steps inside one parent workflow, each one checkpointed: if the process dies between steps, the next start picks up at the last completed one.

SHA-256 as the source of truth

The naive approach is to track ingestion by filename. That breaks the first time someone edits a file in place. Filename is the same; content isn’t. The vector store quietly carries stale chunks.

The fix is content-addressed: hash the file bytes, store the hash alongside the chunks. Every ingestion run:

current_hash = hashlib.sha256(file_bytes).hexdigest()
stored_hash = chunk_metadata_for(filename).get("file_hash")

if stored_hash is None:
 schedule_ingest(filename) # new file
elif stored_hash == current_hash:
 skip(filename) # unchanged
else:
 soft_delete_chunks_for(filename) # changed
 schedule_ingest(filename)

This gives ingestion an idempotent property that’s worth its weight in gold: running the pipeline twice in a row does almost nothing the second time. That’s not just an optimisation — it’s what makes the next section possible.

DBOS workflows

is a Python library that turns regular functions into checkpointed workflows backed by Postgres. The model is dead simple: decorate a function with @DBOS.workflow(), mark each long-running call inside it as a @DBOS.step(), and DBOS records each step’s input, output, and status in Postgres as it runs.

If the workflow crashes — process killed, OS reboot, Postgres connection drop — the next start sees there’s an unfinished workflow with the same ID, replays the recorded step outputs from Postgres (without re-running them), and resumes from the first incomplete step.

Here’s the actual step structure (slightly simplified from backend/services/ingest.py):

@DBOS.workflow()
def ingest_workflow() -> int:
 filenames = list_document_files() # @DBOS.step — scan + hash check
 docs = []
 for name in filenames:
 docs += process_single_document(name) # @DBOS.step — extract text, one file each
 chunks = chunk(docs) # plain Python — fast, re-runs freely
 embeddings = []
 for batch in batches_of_5(chunks):
 embeddings += embed_batch(batch) # @DBOS.step — the slow one, retried on failure
 save_vector_store(embeddings, chunks) # @DBOS.step — append to FAISS + metadata
 return len(chunks)

The granularity of @DBOS.step is the granularity of crash recovery, and it’s chosen deliberately. Extraction is one step per file, so a crash during file 9 of 10 doesn’t re-read the first eight. Embedding is one step per batch of five chunks, for one specific reason: embed_batch is the slow one. If the laptop dies during embeddings, we resume the embedding loop at the failed batch, not at PDF extraction.

Notice what isn’t a step: chunking. Splitting text is fast pure-Python work — checkpointing it would cost more ledger bookkeeping than simply redoing it on a resume.

There’s a related sizing trick hiding in the batch number. DBOS records each step’s output in Postgres, and embed_batch returns its vectors — so each ledger entry contains five embeddings’ worth of floats. Small batches keep each checkpoint record small and each retry cheap. One giant “embed everything” step would mean one giant ledger row and zero resume granularity.

The format extractors

Step 2 (process_single_document) is a dispatch on file extension. Each extractor is small and obvious; the interesting choices are in the chunking strategy each one feeds downstream.

Format	Library	Chunking note
PDF	`pypdf` page-by-page; `pytesseract` OCR fallback for image-only pages	Recursive splitter, 1000/100
DOCX	`python-docx` (paragraphs + table rows joined as text)	Recursive splitter
PPTX	`python-pptx`	One chunk per slide (title + body text)
XLSX	`openpyxl`	Header + 20-row batches, per sheet
MD	`MarkdownHeaderTextSplitter`	One chunk per H1/H2/H3 section, breadcrumb prepended
CSV	manual reader	Header row + 20-row batches
TXT	raw UTF-8 read	Recursive splitter
HTML	`trafilatura` clean text	Recursive splitter

The OCR fallback is the one worth pausing on. PDFs come in two flavours: ones with a real text layer, and ones that are basically scanned images wearing a PDF costume. pypdf returns nothing useful for the second kind, but it doesn’t raise — it just hands back empty strings. Without a fallback, your “ingestion succeeded” log is lying to you.

The detector is a heuristic: if pypdf returns fewer than 50 characters for a page, route the page through pymupdf → Pillow → pytesseract OCR. Slower, but at least produces text. The threshold is tuned to be sensitive enough to catch scanned pages while not punishing legitimately short pages (a chapter cover, a colophon).

Soft delete, not hard delete

When a file changes and we re-ingest, the old chunks need to go. The temptation is to physically remove them from the FAISS index, but FAISS IndexFlatIP doesn’t support efficient delete — you’d have to rebuild.

Soft delete instead: changed files get their old chunks marked with a deleted: true flag in the metadata; new chunks are appended without it. Search filters on the flag at query time, so stale vectors sit harmlessly in the index. If enough dead weight ever accumulates, the escape valve is obvious — rebuild the index from active chunks only — but in practice I haven’t needed it.

This is the same pattern most append-only systems use. It pairs naturally with content hashing — flag-and-append is much cheaper than remove-and-rebuild. One subtlety: the keyword index has to follow suit. CogniVault’s VectorDB.delete_by_source() flips the flags and rebuilds BM25 over the remaining active chunks, so the two retrievers never disagree about what exists.

What the user sees

Starting an ingestion (POST /ingest) returns a workflow_id, and the frontend polls GET /ingest/status/{workflow_id} to draw a live timeline of the workflow’s steps — scanning, per-file extraction (“Reading pages… 3 of 21”), embedding (“Calibrating batch 4 of 12”), saving. If the user closes the tab mid-ingest, comes back five minutes later, and reopens — the workflow finished in the background regardless. The next call to GET /api/vault/stats reflects the new chunk count. No “click to resume” button, no manual recovery dance.

The first time I closed the lid mid-embedding and watched the workflow pick itself up from the next step on resume, I’ll admit I was a little smug. That’s exactly the property I wanted, with surprisingly little code.

Pitfalls and edges

A few things I had to learn the hard way:

Don’t make embed_batch too big. Ollama isn’t great at backpressure. Batches of 5 are a sweet spot for embeddinggemma on a 16 GB machine — bigger batches stall on memory, smaller ones waste round-trip overhead. (And as noted above, the batch size doubles as your checkpoint-record size.)
Be careful with file deletion. Soft-deleted chunks must also disappear from BM25’s corpus, or keyword search will keep returning text that dense search no longer sees. Rebuilding BM25 inside delete_by_source() keeps the two in lockstep.
OCR is slow. A 50-page scan can take a minute or more. Surface that latency to the user; otherwise they think it’s hanging.

Takeaway

Durable workflows aren’t only for distributed systems. A single-user local app benefits from them in exactly the same ways: incremental work, crash recovery, idempotent retries. DBOS makes the cost of opting in trivially low — decorate your function, run Postgres locally, and you get a pipeline that survives lid-closes, OS updates, and your own Ctrl-C.

Combined with content-addressed hashing, ingestion stops being a thing you avoid touching for fear of having to wait 20 minutes. It becomes a thing you re-run whenever you feel like it — because re-running is free when nothing has changed.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
DBOS	Database-Oriented Operating System	A library that checkpoints workflow steps in Postgres so crashed jobs resume instead of restarting
SHA-256	Secure Hash Algorithm, 256-bit	A content fingerprint: change one byte of a file and the hash changes completely
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them
OCR	Optical Character Recognition	Turning pictures of text (scanned pages) into machine-readable text
FAISS	Facebook AI Similarity Search	The vector index the embeddings are appended to
IP (in `IndexFlatIP`)	Inner Product	FAISS’s similarity measure; equals cosine similarity on normalised vectors
BM25	Best Match 25	The keyword index that must stay in lockstep with FAISS on deletes
PDF / DOCX / PPTX / XLSX / MD / CSV / TXT / HTML	Portable Document Format / Word / PowerPoint / Excel / Markdown / Comma-Separated Values / plain text / HyperText Markup Language	The formats the per-extension extractors handle
JSON	JavaScript Object Notation	The format of the chunk-metadata file next to the FAISS index
UTF-8	Unicode Transformation Format, 8-bit	The text encoding used when reading plain-text files
OS	Operating System	What reboots underneath you mid-ingest

Next up: — what happens after Gemma 4 enthusiastically returns {"questions": [{"text": "..."},}].

Part 2 · Hybrid Retrieval in Practice: FAISS + BM25, Fused with RRF

Sat, 25 Apr 2026 00:00:00 +0000

The first version of CogniVault used pure dense retrieval — embed the query with embeddinggemma, search a FAISS index, pass the top-7 chunks to the model. It worked. It worked beautifully — until a user uploaded a PDF containing some German legal text and asked for “§3 Absatz 2.”

The model couldn’t find it.

The chunk was right there. The PDF was indexed. But “§3 Absatz 2” doesn’t embed into anything semantically meaningful — it’s a token-level identifier, not a concept. The dense vector for the query landed nowhere near the dense vector for the chunk, even though the chunk literally contains the string the user asked for.

That bug killed pure dense retrieval for me. This post is about what replaced it.

Two kinds of “similar”

You already use both kinds of search every day. When Spotify builds a “song radio” from a track you like, it’s matching feel — tempo, mood, genre — and it will happily play you a song whose title shares no words with the original. But when you type Bohemian Rhapsody remastered 2011 into the search box, you don’t want feel. You want that exact string, and “a similar operatic rock epic” is a wrong answer.

Search systems formalise that split into two notions of similarity:

Lexical similarity — “do these strings share rare words?” This is what TF-IDF and BM25 model. They thrive on identifiers, names, code, technical terminology, and direct quotes.
Semantic similarity — “do these passages talk about the same idea, even with different words?” This is what embeddings model. They thrive on paraphrase, conceptual queries, and natural-language questions.

Neither subsumes the other. A user asking “how is the practical exam structured?” needs semantic search — the document doesn’t say “structure of practical exam.” A user asking "§3 Absatz 2" needs lexical search — there’s no concept to embed, just a literal string.

Production RAG has to do both. CogniVault does both, and then fuses the result lists with Reciprocal Rank Fusion (RRF).

The stack

Query
 ├── embed via embeddinggemma ──► FAISS IndexFlatIP ──► top-K dense
 └── tokenize + lowercase ──► BM25Okapi ──► top-K sparse
 │
 Reciprocal Rank Fusion ◄──┘
 │
 top-7 fused chunks

Both indexes live in memory, fronted by a VectorDB singleton. FAISS does inner-product search over normalised embeddings (so dot product = cosine). BM25 is rank_bm25’s BM25Okapi, fed the same chunks tokenised by a simple lowercase-and-split tokenizer.

The corpora are kept in lockstep: soft-deleting a file’s chunks triggers a BM25 rebuild over the remaining active chunks, and the singleton reloads both indexes from vector_store.faiss + vector_store.json (chunk metadata + raw text) after every ingestion run and on app start.

Why FAISS `IndexFlatIP`, not HNSW or IVF?

IndexFlatIP is brute-force exact search. It scans every vector, every query. For tens of thousands of chunks that’s fine — sub-millisecond on a laptop. CogniVault is a single-user, local-first app; the index is never going to be billions of vectors. Trading recall for speed via HNSW or IVF would buy nothing here and lose the “exact” guarantee. Boring, correct, fast enough.

When the corpus grows large enough that brute-force gets sticky, switching is a one-line change. Until then, the simplest index wins.

Reciprocal Rank Fusion

The naive way to combine two ranked lists is to score them and add. That sounds reasonable until you remember FAISS returns inner-product scores in some bounded range and BM25 returns scores in an unbounded one — they aren’t comparable without normalisation, and any normalisation you pick is somewhat arbitrary.

RRF sidesteps the problem entirely. It only looks at ranks, not scores. For each result list, an item at rank r contributes 1 / (k + r) to its final score (with k = 60 by convention — large enough to flatten the tail, small enough that the top items still dominate). Items that appear in both lists get summed.

# Simplified — the real implementation also de-duplicates chunks
# by (source, chunk_id, page) before scoring.
def reciprocal_rank_fusion(result_lists, k=60):
 scores = defaultdict(float)
 for results in result_lists:
 for rank, chunk_id in enumerate(results, start=1):
 scores[chunk_id] += 1.0 / (k + rank)
 return sorted(scores.items(), key=lambda kv: kv[1], reverse=True)

That’s the whole algorithm. No tuning, no calibration, no per-corpus weights. A chunk that’s #1 in BM25 and #4 in FAISS easily beats a chunk that’s #2 in only one of them. A chunk that both indexes agree on rises to the top deterministically.

The result for the “§3 Absatz 2” query: BM25 finds the literal match and lands it at rank 1. FAISS finds nothing useful (its top hits are about exam regulations in general). RRF surfaces the BM25 hit at the top of the fused list. Problem solved.

Scope filtering with ContextVar isolation

One detail that’s easy to get wrong: the retriever has to be scope-aware. CogniVault lets users limit a question to a single category or specific files. The scope is set by the request, but the search is called from deep inside the Strands agent loop, which is called from a streaming FastAPI handler, possibly with multiple concurrent requests in flight per worker.

Threading the scope through every function call would be ugly. A global is unsafe. The right primitive is Python’s , which gives you per-task isolated state that asyncio and threads both respect.

from contextvars import ContextVar

_doc_scope: ContextVar[DocScope | None] = ContextVar("doc_scope", default=None)

def set_doc_scope(scope: DocScope | None) -> None:
 _doc_scope.set(scope)

def current_doc_scope() -> DocScope | None:
 return _doc_scope.get()

The /rag request handler sets the scope at the very start of each streaming response; the search tool reads it; because the value is task-local, it dies with the request. No globals, no parameter drilling, no race conditions across concurrent users.

This is one of those design choices that looks like over-engineering until you have two browser tabs open and realise that without it, tab A’s scope filter would leak into tab B’s question.

Chunking choices that pay off downstream

Hybrid retrieval is only as good as the chunks. CogniVault uses a RecursiveCharacterTextSplitter with 1,000 characters, 100 overlap for unstructured text — small enough to keep retrieval precise, large enough to carry context for the model.

For structured formats it switches strategy:

Markdown → MarkdownHeaderTextSplitter emits one chunk per H1/H2/H3 section with the heading hierarchy prepended as a breadcrumb (“Privacy > Vault Audit > Indicators”). BM25 loves breadcrumbs — they make heading-keyword queries match cleanly.
CSV → header row + 20-row batches per chunk, so a query for a column name lands on the right block.
PPTX → one chunk per slide, title and body text together.
XLSX → header + row batches, per sheet, with a [Sheet: name] prefix.

Tiny fragments get filtered: unstructured text needs at least 100 characters to become a chunk, while the structured formats drop the bar to 20 — a two-line Markdown section or a header-only sheet is short but still meaningful. The recursive splitter is well-trodden territory, but the per-format strategies matter much more than people give them credit for.

What I’d do differently

A few things I’d revisit if I were starting over:

Stop tokenising for BM25 with str.split(). It’s fine, but a real tokenizer that handles punctuation and German compounds would meaningfully improve recall on the legal docs.
Add a small reranker. RRF gets the right set, but a cross-encoder rerank on the top 20 would polish the order. Locally-served, of course — there are good small ones now.
Query expansion for thin queries. Two-word questions like “§3 exam” could be expanded via a quick gemma4 call before retrieval. Latency cost, recall gain.

None of those are in the box yet. RRF over FAISS + BM25 is already so much better than either alone that I haven’t felt the pull to optimise further.

The takeaway

If your retrieval is “embed + cosine + top-k,” it will fail in exactly the way mine did — on the queries that contain literal identifiers your model has no embedding for. The fix isn’t a better embedding model. It’s a second retriever that doesn’t pretend everything is a concept.

FAISS for ideas. BM25 for strings. RRF to decide which one was right today.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them
FAISS	Facebook AI Similarity Search	Meta’s library for storing vectors and finding the most similar ones fast
BM25	Best Match 25	A keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	Merges ranked lists using only ranks: each item scores `Σ 1/(k + rank)` across lists
TF-IDF	Term Frequency–Inverse Document Frequency	BM25’s ancestor: score words by how often they appear here vs how rare they are everywhere
IP (in `IndexFlatIP`)	Inner Product	The similarity measure FAISS computes; on normalised vectors it equals cosine similarity
HNSW	Hierarchical Navigable Small World	A popular approximate vector-index structure — deliberately not used here
IVF	Inverted File Index	Another approximate FAISS index type — also deliberately not used
AEVO	Ausbildereignungsverordnung	The German trainer-aptitude regulation whose “§3 Absatz 2” query broke pure dense retrieval
CSV / PPTX / XLSX	Comma-Separated Values / PowerPoint / Excel (Office Open XML)	Structured formats with their own chunking strategies
H1/H2/H3	Heading levels 1–3	The Markdown heading tiers used to split sections

Next up: — how CogniVault’s /rag endpoint streams Gemma 4’s thinking before any tool calls run.

Part 1 · Why I Built a Local-First RAG

Mon, 20 Apr 2026 00:00:00 +0000

I’ve spent the last few years in front of virtual classrooms full of career-changers in Germany, walking them through programming basics, web development, and introductory AI courses. Most of the information we deal with is fine to paste into cloud-based AI tools. Some of it really isn’t.

Exam materials under confidentiality. A trainee’s portfolio with personal details. Other private documents that should never end up training someone else’s model.

So I built — a fully local AI study and productivity tool. No cloud. No telemetry. No “we may use this data to improve our service.” Just Gemma 4 running on Ollama, on my laptop, talking to my files.

The leaky abstraction

The pitch for cloud AI is great: a giant model, available instantly, billed by the token. The fine print is where it gets uncomfortable:

Where does the data physically live during inference?
Whose jurisdiction governs that hardware this afternoon?
Does the audit trail stop at the API boundary, or can you actually trace what happened to your bytes?
When you tick “do not train on my data,” are you trusting a control, a contract, or both?

For most consumer use cases, those questions are fine to wave away. For education, healthcare, finance, legal, public administration — the answer “trust us” isn’t an answer.

What “local-first” actually means here

Lots of products say “private.” I wanted three concrete properties:

The model lives on your machine. Gemma 4 (gemma4:e4b) and embeddinggemma are pulled via Ollama. Inference is a localhost HTTP call.
Your documents never leave. Vectors, chunks, chat history, study sessions, achievements — all on disk on your computer.
You can verify it. Gemma CogniVault ships a Privacy Audit Panel that shows a live “zero external connections” indicator alongside document counts and the Ollama host. It’s not a promise — it’s a status light.

If a future build of Gemma CogniVault ever made an outbound call, that panel would be the first thing to scream.

What you get back

Going local sounds like a trade-off — surely you lose the magic of the giant frontier models? In practice, with Gemma 4 you get more than enough:

Thinking mode — Gemma 4’s chain-of-thought streams into a collapsible panel before the answer. Watching the model reason about your documents is genuinely useful as a teaching tool.
Tool use — through the , the model decides when to search the knowledge base, summarise a document, compare two files, or check the time.
Vision — attach images and PDFs straight into a chat turn.
Generation that’s actually structured — quizzes, multi-lesson workshops, flashcard decks, and interactive mindmaps, generated with format="json" so the output parses reliably.

Cognivault doesn’t try to be a giant ecosystem. It’s a single-purpose tool that does one thing well: use your own documents with a capable local model in a private environment. I must admit that it was inspired to a great extent by , which I’ve found incredibly useful but not private enough for my needs.

The shape of the app

CogniVault is split into four sections that map to how I actually work with information on cloud-based AI tools:

Section	What it’s for
Chat	Ask anything about your documents. Cited answers, scope filter, voice in.
Knowledge Base	Upload, categorise, manage. SHA-256 detects edits on re-upload.
Study Hub	Quiz · Workshop · Flashcards · Mindmaps — four ways to drill into the source.
Dashboard	Total study time, streak, 25 badges, GitHub-style 90-day heatmap.

Everything reachable from a sidebar that remembers where you left off, on a stack that fits in your ~/Documents folder.

What comes next

This is the first in a short series. Over the next few posts I’ll dig into the parts I’m most proud of — and a few I’d build differently next time:

Hybrid retrieval — why FAISS and BM25, fused with Reciprocal Rank Fusion
Two-phase streaming with Gemma 4 and Strands Agents
Crash-resumable ingestion with DBOS, hash-aware re-ingest, OCR fallback
Getting reliable JSON out of a local LLM (and what to do when it fails)
The mindmap renderer — what hand-rolling SVG taught me, and why v2 uses React Flow
Gamifying learning — 25 badges, idle-gap sessions, 90-day heatmap
Testing a local-AI app with 350+ tests and zero infrastructure

If you want to skip ahead, the code is open source at , and there’s a .

Your data. Your hardware. Your AI. Your vault.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory
AI	Artificial Intelligence	Software performing tasks that normally need human intelligence
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
HTTP	HyperText Transfer Protocol	The protocol browsers and APIs use to exchange requests and responses
API	Application Programming Interface	The boundary where you call someone else’s software — and where cloud audit trails stop
IHK	Industrie- und Handelskammer	The German Chamber of Commerce and Industry, which administers trainer certification
AEVO	Ausbildereignungsverordnung	The German trainer-aptitude regulation — the exam material that motivated this project
FAISS	Facebook AI Similarity Search	Meta’s vector-search library (covered in the next post)
BM25	Best Match 25	A classic keyword-ranking formula (also next post)
SDK	Software Development Kit	A library of building blocks — here, Strands, which provides the agent loop
JSON	JavaScript Object Notation	The universal text format for structured data
PDF	Portable Document Format	One of the eight-plus file types CogniVault ingests
SHA-256	Secure Hash Algorithm, 256-bit	A content fingerprint used to detect edited files on re-upload
OCR	Optical Character Recognition	Turning pictures of text (scans) into machine-readable text
DBOS	Database-Oriented Operating System	The durable-workflow library behind crash-resumable ingestion
SVG	Scalable Vector Graphics	The browser’s built-in vector drawing format

RAG |

CogniVault Backend Explained, Part 1 · Meet the Backend: Three Processes, Four Layers

The whole app is three processes

The four layers

One diagram, every major piece

The tech stack, and why each piece earned its place

Appendix: Abbreviations in this post

The takeaway

CogniVault Backend Explained, Part 2 · From File to Searchable Knowledge

The conveyor belt

The factory ledger: why the pipeline can’t lose work

Fingerprints, not faith: SHA-256 change detection

Every format gets its own treatment

Chunking: 1,000 characters with a 100-character safety overlap

Embedding and saving

The whole journey, end to end

The takeaway

Appendix: Abbreviations in this post

CogniVault Backend Explained, Part 3 · How a Question Becomes a Cited Answer

Two librarians, because one keeps failing you

The agent: a model that decides for itself

Watching the model think

A memory budget, not a bottomless pit

Scope: pinning the AI to specific documents

The takeaway

Appendix: Abbreviations in this post

Part 1 · CogniVault Architecture: Why Standard RAG Isn't Enough (Hybrid Search)

Two ways of finding a book

The agent decides when to search

What’s Next?

Appendix: Abbreviations in this post

Gemma CogniVault

Overview

What’s inside

Four sections

Highlights

Writing about it

Try it

Part 4 · Crash-Resumable Ingestion: DBOS, SHA-256, and Surviving a kill -9

The pipeline

SHA-256 as the source of truth

DBOS workflows

The format extractors

Soft delete, not hard delete

What the user sees

Pitfalls and edges

Takeaway

Appendix: Abbreviations in this post

Part 2 · Hybrid Retrieval in Practice: FAISS + BM25, Fused with RRF

Two kinds of “similar”

The stack

Why FAISS IndexFlatIP, not HNSW or IVF?

Reciprocal Rank Fusion

Scope filtering with ContextVar isolation

Chunking choices that pay off downstream

What I’d do differently

The takeaway

Appendix: Abbreviations in this post

Part 1 · Why I Built a Local-First RAG

The leaky abstraction

What “local-first” actually means here

What you get back

The shape of the app

What comes next

Appendix: Abbreviations in this post

Why FAISS `IndexFlatIP`, not HNSW or IVF?