Beginner Guides |

CogniVault Backend Explained, Part 1 · Meet the Backend: Three Processes, Four Layers

Fri, 12 Jun 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

When people first open the CogniVault repository, the question I hear most is some version of: “Where do I even start?” There’s a RAG agent, a FAISS index, a DBOS workflow, an Ollama host — and if you’re transitioning into tech, every one of those words is a closed door.

This series opens the doors one at a time. No prior RAG knowledge assumed, every abbreviation spelled out, and every claim checkable against the . If you’ve already read my , think of this series as the guided tour that should have come first.

Let’s map this out.

The whole app is three processes

CogniVault lets you chat with your own documents and turn them into quizzes, workshops, flashcards, and mindmaps — and nothing ever leaves your machine. (The why behind that constraint is its own story: .)

You might expect an app like that to be a sprawl of microservices. It’s three processes:

Process	What it does
The Python backend	One FastAPI app on port 8000 — it also serves the compiled React frontend as static files
Ollama	The local model server on port 11434, running the AI models
PostgreSQL	One Docker container, used only for workflow checkpoints — never for your documents

Everything else — your files, the search index, your chat history, your quiz scores — is a plain file on disk. That’s not laziness; it’s the privacy argument made physical. You can open every byte the app stores with a text editor and a SQLite browser.

The four layers

Before we name technologies, here’s the mental model I want you to keep for the whole series. The backend is four layers, top to bottom:

Layer 1 — the web layer. A FastAPI application receives every HTTP request and routes it to one of six routers: chat (/rag), knowledge management (/upload, /ingest), study tools (/api/study/*), progress (/api/progress/*), voice (/api/transcribe), and chat history (/api/history). FastAPI (a modern Python web framework) also auto-generates interactive API documentation at /api/docs, which is the best way to explore the backend without reading a line of code.

Layer 2 — the intelligence layer. Two AI models with two different jobs. gemma4:e4b generates: chat answers, reasoning, image analysis, and tool calls. embeddinggemma embeds: it turns text into vectors (lists of numbers that capture meaning) so similar ideas can be found mathematically. Both run inside Ollama — think of Ollama as Docker, but for AI models.

Layer 3 — the retrieval layer. A search engine over your documents that combines semantic search (find things that mean the same) with keyword search (find the exact string). Part 3 of this series is entirely about this layer.

Layer 4 — the persistence layer. Four storage systems, each picked for one job: a FAISS index plus a JSON file for searchable knowledge, SQLite for study data, PostgreSQL for workflow checkpoints, and plain JSON files for chat history.

One diagram, every major piece

flowchart TB subgraph CLIENT["Browser"] UI["React Frontend
(compiled, served by FastAPI)"] end subgraph SERVER["FastAPI Backend — port 8000"] ROUTERS["6 Routers
rag · knowledge · study ·
progress · audio · history"] AGENT["RAG Agent
(Strands SDK, 6 tools)"] VDB["VectorDB
FAISS + BM25 + RRF"] INGEST["Ingestion
(DBOS durable workflow)"] GEN["Study generators
quiz · workshop · cards · mindmap"] PROG["Progress tracker
+ 25 achievements"] end subgraph OLLAMA["Ollama — port 11434"] GEMMA["gemma4:e4b
chat · thinking · vision · tools"] EMBED["embeddinggemma
text to vectors"] end subgraph STORAGE["Local storage"] FAISSF["vector_store.faiss + .json"] SQLITE["progress.db (SQLite)"] PG["PostgreSQL
workflow state only"] DOCS["docs/ folder + chat_history.json"] end UI --> ROUTERS ROUTERS --> AGENT --> VDB AGENT --> GEMMA VDB --> EMBED ROUTERS --> INGEST --> EMBED INGEST --> PG INGEST --> FAISSF VDB --- FAISSF ROUTERS --> GEN --> GEMMA GEN --> SQLITE ROUTERS --> PROG --> SQLITE ROUTERS --> DOCS

Keep this picture handy — Parts 2, 3, and 4 each zoom into one region of it.

The tech stack, and why each piece earned its place

The full dependency list lives in requirements.txt. Here’s what matters, grouped by job:

Serving requests. FastAPI defines the endpoints and validates every request and response with Pydantic (a data-validation library — think of it as a strict customs officer for JSON). Uvicorn is the ASGI server (Asynchronous Server Gateway Interface — the Python standard that lets one process juggle many simultaneous requests) that actually runs it.

Thinking. Ollama serves gemma4:e4b — the e4b tag is the roughly four-billion effective-parameter variant, about a 9.6 GB download — and embeddinggemma (about 622 MB). The agent behaviour is built with the Strands Agents SDK, which wraps the model in a loop where it can call tools, read the results, and only then answer. (Where I run Ollama relative to Docker is a deliberate choice with a story behind it: .)

Finding things. FAISS (Facebook AI Similarity Search — Meta’s vector search library) handles semantic lookups; rank-bm25 handles keyword lookups; a formula called Reciprocal Rank Fusion merges the two. Part 3 unpacks all of this.

Reading documents. pypdf for PDFs, with an OCR fallback (Optical Character Recognition — turning pictures of text into actual text) for scanned pages via pymupdf and Tesseract. Word, PowerPoint, and Excel each get their own extractor. trafilatura pulls clean article text out of web pages.

Not losing work. DBOS makes the ingestion pipeline durable — every step is checkpointed in PostgreSQL so a crash resumes instead of restarting. Part 2 shows this in action.

Remembering. SQLite — a complete database engine that lives in a single file, progress.db — holds your study sessions, achievements, quizzes, workshops, flashcard decks, and mindmaps.

Appendix: Abbreviations in this post

This series’ promise is “no unexplained abbreviations,” so here is the table I wish every technical tutorial shipped with.

Abbreviation	Full form	Plain-English meaning
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
RAG	Retrieval-Augmented Generation	Fetch relevant passages from your documents first, then let the model answer from them — instead of from its training memory
API	Application Programming Interface	The set of URLs the frontend calls to talk to the backend
ASGI	Asynchronous Server Gateway Interface	The Python standard that lets the server handle many requests concurrently
JSON	JavaScript Object Notation	The universal text format for structured data
NDJSON	Newline-Delimited JSON	A stream where each line is its own JSON object — ideal for streaming AI answers chunk by chunk
FAISS	Facebook AI Similarity Search	Meta’s library for storing vectors and finding the most similar ones fast
BM25	Best Match 25	A classic keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	A formula for merging multiple ranked result lists using only the ranks
ANN	Approximate Nearest Neighbour	A speed shortcut many vector databases take. CogniVault deliberately uses an exact index instead — precise, and plenty fast at personal-library scale
DBOS	Database-Oriented Operating System (the research project it grew from)	A library that checkpoints workflow steps in a database so crashed jobs resume
SQL / SQLite	Structured Query Language / SQLite	The language of relational databases / a tiny database that lives in one file
OCR	Optical Character Recognition	Turning pictures of text (scans) into machine-readable text
SHA-256	Secure Hash Algorithm, 256-bit	A fingerprint function — any file maps to a unique hash, used to detect changed files
CORS	Cross-Origin Resource Sharing	Browser rules controlling which websites may call the API
SSRF	Server-Side Request Forgery	An attack where a server is tricked into fetching internal URLs — the URL-import endpoint guards against it
MCQ	Multiple-Choice Question	One of the two quiz question types
KB	Knowledge Base	All your ingested, searchable documents

(Every claim in this series can be checked directly against the — the relevant file is named whenever it matters, and the repository README maps the full architecture.)

The takeaway

Strip away the abbreviations and CogniVault is a small system: one web server, one model runtime, one durability database, and a handful of files. The sophistication isn’t in the part count — it’s in how a few well-chosen pieces cooperate. That cooperation is what the next three parts are about.

Next up: — how a 1,000-page scanned PDF becomes something the AI can search in seconds, and why the pipeline survives a crash at page 800.

CogniVault Backend Explained, Part 2 · From File to Searchable Knowledge

Fri, 12 Jun 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

An LLM cannot “open” your PDF. That sentence surprises a lot of newcomers, so let’s sit with it for a second: when you chat with your documents in CogniVault, the model never touches the original files. Something has to happen between “I dropped a file into the browser” and “the AI just quoted page 47 back at me.”

That something is ingestion, and it’s the subject of this part. In we drew the whole map; today we zoom into one region — the conveyor belt that turns files into searchable knowledge.

The conveyor belt

Think of ingestion as a four-station assembly line:

Extract the text out of each file — even scanned ones.
Chunk it into pieces small enough to fit into a prompt.
Embed each chunk — turn it into a vector (a list of numbers that captures its meaning) so similar ideas land near each other in vector space.
Store vectors and metadata so they can be searched later.

flowchart TD A["Upload
POST /upload
saved to docs/"] --> B subgraph WF["DBOS durable workflow"] B["Step 1
Which files changed?
SHA-256 fingerprints"] --> C["Step 2
Extract text
per-format + OCR fallback"] C --> D["Chunk
1000 chars, 100 overlap"] D --> E["Step 3
Embed
embeddinggemma, batches of 5"] E --> F["Step 4
Save
FAISS index + metadata JSON"] end F --> G["Reload in-memory index
instantly searchable"]

Simple enough. The interesting engineering is in the failure cases — so let’s start there.

The factory ledger: why the pipeline can’t lose work

Embedding a large library takes minutes. What happens when your laptop goes to sleep at page 800 of a 1,000-page manual? With a plain Python script: everything restarts from page 1.

CogniVault instead writes the pipeline as a DBOS durable workflow. Picture a factory where every station stamps a permanent ledger the moment it finishes a box. If the power cuts out, nobody rebuilds finished boxes — the workers read the ledger and resume at the first unstamped entry.

DBOS is that ledger, and PostgreSQL is the book it’s written in. Each pipeline station is a checkpointed step; on restart, completed steps return their recorded results instantly and execution continues from the first unfinished one. A failed embedding batch is simply retried.

This is also what powers the live progress timeline in the UI: starting an ingestion returns a workflow_id, and the frontend polls a status endpoint that reports which steps have completed, which are running, and which are still waiting.

I wrote a whole deep dive on this mechanism — including what happens when you kill -9 the process mid-ingest — in .

Fingerprints, not faith: SHA-256 change detection

Re-embedding your whole library every time you add one file would be wasteful. So before any work happens, the pipeline computes each file’s SHA-256 hash (a content fingerprint — change one character in the file and the fingerprint changes completely) and compares it to the fingerprint stored with the file’s existing chunks:

Never seen before → ingest it.
Fingerprint changed → the old chunks are soft-deleted and the file is re-ingested.
Fingerprint identical → skip it entirely.

Why “soft”-deleted? Because the FAISS index type CogniVault uses cannot remove individual vectors. Stale chunks are just marked deleted: true in the metadata; their vectors stay in the index but every search filters them out. It’s an honest, boring solution — and it never corrupts the index.

Every format gets its own treatment

Here’s a detail that separates a demo from a product. A naive pipeline extracts “all the text” and calls it a day. CogniVault gives each format an extractor that preserves the structure that retrieval will need later:

Format	Strategy
PDF	Page by page, keeping page numbers (those become citations later). Any page yielding fewer than 50 characters is presumed scanned and sent to OCR
Scanned page	The page is rendered to an image at roughly 144 dpi, then Tesseract OCR (Optical Character Recognition — reading text out of images) extracts the words
Markdown	Split on headings; each section chunk gets a breadcrumb prefix like `[Section: Intro > Setup]` so its embedding carries the document hierarchy
CSV	Rows grouped 20 per chunk — and every chunk is prefixed with the header row, so the model always knows the column names
Excel	Same row-group idea per sheet, prefixed `[Sheet: name]`
PowerPoint	One chunk per slide
Word	Paragraphs plus table cells
Web pages	Fetched on request and stripped to clean article text — behind an SSRF guard (Server-Side Request Forgery protection: the server refuses to fetch private or internal addresses)

Ask yourself why the CSV detail matters. If chunk 14 of a spreadsheet is just twenty naked rows of numbers, no search will ever connect it to the question “what was the Q3 budget?” Prefix it with the header row, and the chunk knows it contains budget columns. Structure is retrieval fuel.

Chunking: 1,000 characters with a 100-character safety overlap

Long text is split into pieces of about 1,000 characters, with neighbouring pieces overlapping by 100. The overlap is insurance: a sentence sliced at a chunk boundary still appears whole in one of the two neighbours, so no idea falls into the gap between chunks.

Embedding and saving

Chunks are embedded by embeddinggemma (via Ollama) in batches of five — each chunk becomes one vector. The vectors are normalised and appended to a FAISS index; alongside it, a JSON file records each chunk’s source filename, page number, category, fingerprint, and the text itself. The index holds the numbers; the JSON holds the meaning.

One choice worth highlighting for beginners: this is an exact index, not an approximate one. Many vector databases use ANN (Approximate Nearest Neighbour) shortcuts that trade a little accuracy for speed at massive scale. At personal-library scale you don’t need the trade — CogniVault checks every vector on every search and is still fast.

The whole journey, end to end

%%{init: {'sequence': {'actorFontSize': 28, 'messageFontSize': 24, 'loopTextFontSize': 22, 'noteFontSize': 22}}}%% sequenceDiagram actor U as You participant F as Frontend participant B as FastAPI participant W as DBOS Workflow participant O as Ollama (embeddinggemma) participant V as FAISS + metadata U->>F: Drag and drop a file, pick a category F->>B: POST /upload B->>B: Validate type and size, save to docs/ F->>B: POST /ingest B->>W: Start durable workflow B-->>F: workflow_id loop Poll status F->>B: GET /ingest/status/{workflow_id} B-->>F: Step list (drives the progress timeline) end W->>W: SHA-256 change detection W->>W: Extract text (per format, OCR if scanned) W->>W: Chunk (1000 chars / 100 overlap) W->>O: Embed in batches of 5 O-->>W: Vectors W->>V: Append vectors + metadata B-->>F: SUCCESS — index reloaded F-->>U: "Knowledge Sync Complete"

The takeaway

Ingestion is where most RAG quality is actually won or lost — long before any clever prompting. Page numbers preserved, headers carried into every spreadsheet chunk, scans rescued by OCR, and a ledger that makes the whole thing crash-proof: none of it is glamorous, all of it shows up later as answers that cite the right page.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
DBOS	Database-Oriented Operating System	The library that checkpoints workflow steps in PostgreSQL so crashed jobs resume
SHA-256	Secure Hash Algorithm, 256-bit	A content fingerprint — change one byte of a file and the hash changes completely
OCR	Optical Character Recognition	Reading text out of images — the rescue path for scanned PDF pages
SSRF	Server-Side Request Forgery	An attack where a server is tricked into fetching internal URLs; the URL importer blocks it
FAISS	Facebook AI Similarity Search	The vector index the embeddings are appended to
ANN	Approximate Nearest Neighbour	The accuracy-for-speed shortcut CogniVault deliberately does not take
dpi	Dots Per Inch	Image resolution — scanned pages are rendered at ~144 dpi before OCR
JSON	JavaScript Object Notation	The format of the chunk-metadata file beside the FAISS index
PDF / CSV	Portable Document Format / Comma-Separated Values	Two of the eight-plus supported file formats
API	Application Programming Interface	The endpoints (`/upload`, `/ingest`, `/ingest/status/…`) driving the flow

Next up: — hybrid retrieval, the six-tool agent, and the two-phase stream that shows the model think before it answers.

CogniVault Backend Explained, Part 3 · How a Question Becomes a Cited Answer

Fri, 12 Jun 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

You type a question. A few seconds later you get an answer with footnotes — the exact documents and pages it came from. This part walks through everything that happens in between.

In we built the knowledge base: every document chunked, embedded, and indexed. Now we get to use it — and this is where CogniVault stops being a pipeline and starts being interesting.

Two librarians, because one keeps failing you

Imagine a library with one librarian who organises everything by vibe. Ask her about “server downtime procedures” and she’s brilliant — she understands what you mean and finds documents that discuss the concept, whatever words they use. But ask her for “Error Code 404B” and she shrugs, handing you general networking guides. She doesn’t do exact strings.

Down the hall is a second librarian with a card catalogue. He finds the exact string “404B” instantly — but ask him a conceptual question phrased differently from the source text, and he finds nothing at all.

These are the two halves of search:

Semantic search (FAISS) — your question is embedded into a vector, and the index finds chunks whose vectors point the same way (technically: cosine similarity — how closely two arrows align). Great for meaning, blind to exact identifiers.
Keyword search (BM25) — a scoring formula that rewards chunks containing your exact words, weighted by how distinctive those words are. Great for identifiers, blind to synonyms.

CogniVault asks both librarians every time, then merges their answers with Reciprocal Rank Fusion (RRF) — a formula that combines ranked lists using only the positions:

score(chunk) = sum over both lists of 1 / (60 + rank)

A chunk ranked highly by either librarian scores well; a chunk both of them liked floats to the top. The elegance is what’s missing: you never have to reconcile FAISS’s similarity scores with BM25’s completely different scale, because ranks are the only input. The constant 60 comes straight from the original 2009 research paper, and yes, it’s cited in the code.

A few implementation details worth knowing: both searches deliberately over-fetch (at least 20 candidates each) so the fusion has material to work with; very weak semantic matches are dropped, but a keyword-perfect chunk can still be rescued through fusion; and the final answer uses the top 7 chunks. I benchmarked this whole setup against pure vector search in if you want the war stories.

The agent: a model that decides for itself

Here’s the second idea that trips up beginners: CogniVault’s chat is not “paste chunks into a prompt, get an answer.” It’s an agent — a model running in a loop where it can choose to call tools, read their results, and only then answer.

Built with the Strands Agents SDK, the agent gets six tools:

Tool	Job
`search_knowledge_base`	The core RAG tool — runs the hybrid search above, returns chunks with source and page
`list_documents`	See what’s in the vault
`analyze_document`	Structured analysis of one document: topics, entities, facts, summary
`compare_documents`	Answer a question by comparing two documents side by side
`calculator`	Safe maths — the expression is parsed into a syntax tree and only whitelisted operators run. No `eval()`, ever
`current_time`	The date and time

There is no hard-coded routing. The model reads your question and decides which tools to call, guided by its system prompt. Ask “compare the two contracts on termination clauses” and it reaches for compare_documents; ask “what’s 15% of 2,340” and it uses the calculator instead of hallucinating arithmetic.

Two safety details I want beginners to notice, because they’re the difference between a toy and a product: a fresh agent is constructed for every request (no shared state bleeding between concurrent chats), and the document-analysis tools call the model directly rather than through the agent — otherwise an agent calling a tool that calls the agent could recurse forever.

Watching the model think

When you send a message, the response streams back as NDJSON (Newline-Delimited JSON — each line of the stream is its own small JSON object). And it arrives in two phases:

Phase 1 — thinking. Gemma’s reasoning chain streams first, rendered in the collapsible panel above the answer. It’s deliberately best-effort: if it fails for any reason, the answer still comes.

Phase 2 — the agent answer. Tools run, citations appear in the Sources panel the moment the search completes — before the answer finishes writing — and the answer text streams in.

flowchart TB Q["Your question
(plus optional images, files, scope)"] --> P1 subgraph STREAM["POST /rag — one NDJSON stream"] P1["Phase 1: Thinking
reasoning chunks stream first"] P1 --> P2["Phase 2: Agent
fresh per request, history restored"] P2 -->|"decides to call"| T["search_knowledge_base"] T --> D["FAISS
semantic"] T --> S["BM25
keywords"] D --> RRF["RRF fusion — top 7 chunks"] S --> RRF RRF -->|"chunks + citations"| P2 P2 --> OUT["citations, then answer text,
then a memory-usage report"] end

Each line in the stream is typed: thinking, metadata (a citation), text (answer), memory (how full the conversation budget is), or error. The frontend just reads lines and routes them to the right panel. I dissected this design — and why thinking comes before the tool calls — in .

A memory budget, not a bottomless pit

Gemma’s context window (the amount of text the model can consider at once) is 128K tokens, but CogniVault doesn’t let conversation history sprawl across all of it. Each chat session gets a budget of 48,000 characters — roughly 12,000 tokens. Exceed it, and the oldest question-answer pair quietly drops out first, keeping the bulk of the window free for what matters: your current question and the retrieved chunks.

Two resilience touches worth stealing for your own projects:

Restart survival. In-memory history dies with the process. So the first message in a session after a backend restart rebuilds its history from the chat log the frontend persists. Multi-turn memory survives reboots.
Edit and regenerate. Editing an earlier message rewinds the stored history to that point before re-asking — the model genuinely forgets the timeline that no longer exists.

Scope: pinning the AI to specific documents

One last feature, and a lesson about small local models. You can pin a chat to specific files or a category. The filter travels with the request and a mandatory-search instruction is injected into both the system prompt and the user message itself.

Why both? Because small models sometimes skip instructions that live only in the system prompt — but they can’t ignore what’s inside the question. Belt and braces. When you work with 4-billion-parameter models instead of frontier ones, you learn to make instructions impossible to miss rather than hoping they’re followed.

The takeaway

A cited answer is four systems cooperating: two retrievers covering each other’s blind spots, a fusion formula that needs nothing but ranks, an agent that picks its own tools, and a stream that shows its work. None of the four is exotic on its own — the product is the cooperation.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them
FAISS	Facebook AI Similarity Search	The semantic (meaning-based) half of hybrid search
BM25	Best Match 25	The keyword half — a classic ranking formula from the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	Merges the two ranked lists using only each chunk’s rank: `score = Σ 1/(60 + rank)`
NDJSON	Newline-Delimited JSON	A stream where each line is its own complete JSON object — the chat response format
JSON	JavaScript Object Notation	The universal text format for structured data
AST	Abstract Syntax Tree	The parsed form of an expression — how the calculator does maths without `eval()`
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
SDK	Software Development Kit	A library of building blocks — here, Strands, which provides the agent loop
K (in 128K)	Kilo (thousand)	128K tokens ≈ 128,000 tokens — Gemma’s context window

Next up: — the same machinery pointed at generating quizzes, workshops, flashcards, and mindmaps, plus a table of every byte the app stores and exactly where it lives.

CogniVault Backend Explained, Part 4 · Study Tools, Progress, and the Privacy Receipts

Fri, 12 Jun 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

In we followed a question through hybrid retrieval and the agent loop to a cited answer. In this final part, the same machinery gets pointed at a different goal: teaching you — and then we close the series by auditing the project’s central promise: nothing leaves your machine.

One recipe, four study tools

CogniVault generates quizzes, multi-lesson workshops, flashcard decks, and mindmaps from your documents. Four different outputs — but under the hood, one shared five-step recipe:

Retrieve. The same hybrid search from Part 3, but instead of your question, the probe is a broad query like “key concepts, definitions, important facts, main ideas”, scoped to the documents you selected. Up to 15 representative chunks come back.
Prompt from a template. The instructions sent to Gemma are not buried in Python — they’re editable Markdown files in backend/prompts/ (quiz.md, flashcards.md, and so on). Drop a modified copy into backend/prompts/custom/ and it overrides the shipped version on the very next request. No restart, no code change. Prompt engineering as configuration.
Constrain the output. Asking a small local model to “please return JSON” works most of the time — and most of the time is a production bug. CogniVault uses Ollama’s grammar-constrained generation (format="json"), which makes invalid JSON impossible rather than unlikely, plus low temperature for consistency. The full saga of getting reliable structure out of a 4-billion-parameter model is in .
Validate defensively. Every generated item is checked field by field, and malformed items are dropped rather than failing the whole batch. Small models occasionally fumble one question out of ten; a product shouldn’t collapse because of it.
Persist. Everything lands in SQLite, so quizzes are resumable, workshop progress survives restarts, and flashcard statuses are remembered per deck.

Here’s the recipe in motion for a quiz:

%%{init: {'sequence': {'actorFontSize': 28, 'messageFontSize': 24, 'loopTextFontSize': 22, 'noteFontSize': 22}}}%% sequenceDiagram actor U as You participant F as Study Hub UI participant B as FastAPI participant V as VectorDB participant O as Ollama (gemma4:e4b) participant S as SQLite U->>F: Pick scope, difficulty, question count F->>B: POST /api/study/quiz/generate B->>V: Hybrid search, scoped to your documents V-->>B: Up to 15 representative chunks B->>B: Render the quiz.md prompt template B->>O: chat(format="json", low temperature) O-->>B: Grammar-constrained JSON B->>B: Validate each question, drop bad ones B->>S: Save quiz (resumable later) B-->>F: Typed response F-->>U: Play, submit, score — and maybe a new badge

The four tools differ only in their template and their shape: quizzes produce multiple-choice and true/false questions with explanations; workshops produce an outline first and then write each lesson on demand when you open it; flashcards produce front/back pairs; mindmaps produce a topic tree that the frontend renders as an interactive diagram. (That renderer is its own adventure: .)

Sessions that track themselves

Most study apps make you press a start button, and most people forget. CogniVault takes a different stance: study sessions are inferred, not declared.

Every chat message either extends the current session or — after a 15-minute idle gap — quietly starts a new one. Walk away for coffee, come back, keep working: same session. Come back tomorrow: new session. No buttons, no forgetting.

Each message also records a tiny event (timestamp, whether you used a scope filter or attachments) into progress.db — a SQLite database, which is a complete relational database living in a single file. Eleven tables hold everything: sessions, message events, earned badges, quiz attempts and saved quizzes, workshops and lessons, decks and cards, and mindmaps.

One engineering note worth copying: the tracking call inside the chat endpoint is wrapped so that it can never block or break the chat. Analytics must be a passenger, never a driver.

25 badges, defined as data

The achievements aren’t scattered through the code as if statements. They live in one JSON file — 25 entries, each with a code, a name, an icon, the metric it watches, and a target. After each relevant action, an evaluator checks every definition against the database and persists anything newly earned. Some badges form ladders, each pointing to its next level.

Declarative beats imperative here for a simple reason: adding badge number 26 means adding a JSON entry, not writing new logic. The design behind the streaks, the idle-gap rule, and the 90-day heatmap got its own post: .

Voice input, without a cloud microphone

The microphone button is powered by faster-whisper — OpenAI’s Whisper speech-recognition model re-implemented on a faster inference engine — running on your CPU with int8 quantisation (8-bit numbers instead of 32-bit: smaller, faster, accurate enough). No audio ever leaves the machine.

The model is lazy-loaded on the first transcription so app startup stays instant, and if faster-whisper isn’t installed at all, the frontend simply hides the mic button. Features should degrade, not detonate.

The privacy receipts

The series began with a promise: nothing leaves your machine. Promises are cheap — here’s the audit. Every byte CogniVault stores, and where it lives:

Data	Location	Format
Your uploaded files	`docs/` folder	The original files
Search vectors	`vector_store.faiss`	FAISS binary index
Chunk text and metadata	`vector_store.json`	JSON
File-to-category map	`categories.json`	JSON
Chat sessions	`chat_history.json`	JSON
Sessions, badges, quizzes, workshops, decks, mindmaps	`progress.db`	SQLite
Ingestion checkpoints	PostgreSQL (local Docker volume)	DBOS system tables
The AI models themselves	Ollama’s local model store	Model weights

Nothing in that table is on someone else’s computer. Inference goes to localhost. Embeddings go to localhost. The only outbound request the backend ever makes is the URL-import feature — at your explicit request, and guarded against fetching private addresses. The app even surfaces these stats live in its Privacy Vault Audit panel.

And because trust needs more than a table: the whole backend is covered by a pytest suite you can run yourself — the approach is documented in .

Series wrap-up

Four parts, one architecture:

— three processes, four layers, and a decoder ring for the jargon
— a durable, format-aware pipeline that turns any document into searchable vectors
— two retrievers covering each other’s blind spots, fused by rank, driven by an agent
Part 4 — the same machinery generating study materials, tracking progress without buttons, and a storage map with no cloud rows in it

If there’s one theme, it’s this: boring, verifiable choices in service of privacy. Exact search instead of approximate. SQLite files instead of hosted databases. Grammar-constrained JSON instead of hopeful parsing. Soft deletes instead of clever index surgery. Every piece is something you can open, read, and check — which is exactly the point.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
JSON	JavaScript Object Notation	The structured format the generators force the model to produce
SQLite / SQL	(SQL = Structured Query Language)	A complete relational database living in one file, `progress.db`
MCQ	Multiple-Choice Question	One of the two quiz question types (the other is true/false)
CPU	Central Processing Unit	Where Whisper runs — no graphics card required
int8	8-bit integer (quantisation)	Storing model weights as small integers: smaller, faster, accurate enough
AI	Artificial Intelligence	Software performing tasks that normally need human intelligence
API	Application Programming Interface	The endpoints the Study Hub and dashboard call
FAISS	Facebook AI Similarity Search	The vector index in the privacy-receipts table
DBOS	Database-Oriented Operating System	The durable-workflow library whose checkpoints live in PostgreSQL
SSRF	Server-Side Request Forgery	The attack class the URL importer guards against
PNG / PDF	Portable Network Graphics / Portable Document Format	Two of the mindmap export formats (plus Markdown)
SVG	Scalable Vector Graphics	The browser drawing format behind the interactive mindmap rendering

Next steps: clone and read along — the README maps the full architecture, and every claim in this series can be checked directly against the code in backend/. And if you want the deep-dive versions of these topics, the picks up where this tour ends.