Part 1 · Why I Built a Local-First RAG

All abbreviations are fully explained in the appendix at the bottom of the page.

I’ve spent the last few years in front of virtual classrooms full of career-changers in Germany, walking them through programming basics, web development, and introductory AI courses. Most of the information we deal with is fine to paste into cloud-based AI tools. Some of it really isn’t.

Exam materials under confidentiality. A trainee’s portfolio with personal details. Other private documents that should never end up training someone else’s model.

So I built Gemma CogniVault — a fully local AI study and productivity tool. No cloud. No telemetry. No “we may use this data to improve our service.” Just Gemma 4 running on Ollama, on my laptop, talking to my files.

The leaky abstraction

The pitch for cloud AI is great: a giant model, available instantly, billed by the token. The fine print is where it gets uncomfortable:

Where does the data physically live during inference?
Whose jurisdiction governs that hardware this afternoon?
Does the audit trail stop at the API boundary, or can you actually trace what happened to your bytes?
When you tick “do not train on my data,” are you trusting a control, a contract, or both?

For most consumer use cases, those questions are fine to wave away. For education, healthcare, finance, legal, public administration — the answer “trust us” isn’t an answer.

What “local-first” actually means here

Lots of products say “private.” I wanted three concrete properties:

The model lives on your machine. Gemma 4 (gemma4:e4b) and embeddinggemma are pulled via Ollama. Inference is a localhost HTTP call.
Your documents never leave. Vectors, chunks, chat history, study sessions, achievements — all on disk on your computer.
You can verify it. Gemma CogniVault ships a Privacy Audit Panel that shows a live “zero external connections” indicator alongside document counts and the Ollama host. It’s not a promise — it’s a status light.

If a future build of Gemma CogniVault ever made an outbound call, that panel would be the first thing to scream.

What you get back

Going local sounds like a trade-off — surely you lose the magic of the giant frontier models? In practice, with Gemma 4 you get more than enough:

Thinking mode — Gemma 4’s chain-of-thought streams into a collapsible panel before the answer. Watching the model reason about your documents is genuinely useful as a teaching tool.
Tool use — through the Strands Agents SDK, the model decides when to search the knowledge base, summarise a document, compare two files, or check the time.
Vision — attach images and PDFs straight into a chat turn.
Generation that’s actually structured — quizzes, multi-lesson workshops, flashcard decks, and interactive mindmaps, generated with format="json" so the output parses reliably.

Cognivault doesn’t try to be a giant ecosystem. It’s a single-purpose tool that does one thing well: use your own documents with a capable local model in a private environment. I must admit that it was inspired to a great extent by NotebookLM, which I’ve found incredibly useful but not private enough for my needs.

The shape of the app

CogniVault is split into four sections that map to how I actually work with information on cloud-based AI tools:

Section	What it’s for
Chat	Ask anything about your documents. Cited answers, scope filter, voice in.
Knowledge Base	Upload, categorise, manage. SHA-256 detects edits on re-upload.
Study Hub	Quiz · Workshop · Flashcards · Mindmaps — four ways to drill into the source.
Dashboard	Total study time, streak, 25 badges, GitHub-style 90-day heatmap.

Everything reachable from a sidebar that remembers where you left off, on a stack that fits in your ~/Documents folder.

What comes next

This is the first in a short series. Over the next few posts I’ll dig into the parts I’m most proud of — and a few I’d build differently next time:

Hybrid retrieval — why FAISS and BM25, fused with Reciprocal Rank Fusion
Two-phase streaming with Gemma 4 and Strands Agents
Crash-resumable ingestion with DBOS, hash-aware re-ingest, OCR fallback
Getting reliable JSON out of a local LLM (and what to do when it fails)
The mindmap renderer — what hand-rolling SVG taught me, and why v2 uses React Flow
Gamifying learning — 25 badges, idle-gap sessions, 90-day heatmap
Testing a local-AI app with 350+ tests and zero infrastructure

If you want to skip ahead, the code is open source at github.com/ndimoforaretas/local-gemma-rag, and there’s a demo walkthrough on YouTube.

Your data. Your hardware. Your AI. Your vault.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory
AI	Artificial Intelligence	Software performing tasks that normally need human intelligence
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
HTTP	HyperText Transfer Protocol	The protocol browsers and APIs use to exchange requests and responses
API	Application Programming Interface	The boundary where you call someone else’s software — and where cloud audit trails stop
IHK	Industrie- und Handelskammer	The German Chamber of Commerce and Industry, which administers trainer certification
AEVO	Ausbildereignungsverordnung	The German trainer-aptitude regulation — the exam material that motivated this project
FAISS	Facebook AI Similarity Search	Meta’s vector-search library (covered in the next post)
BM25	Best Match 25	A classic keyword-ranking formula (also next post)
SDK	Software Development Kit	A library of building blocks — here, Strands, which provides the agent loop
JSON	JavaScript Object Notation	The universal text format for structured data
PDF	Portable Document Format	One of the eight-plus file types CogniVault ingests
SHA-256	Secure Hash Algorithm, 256-bit	A content fingerprint used to detect edited files on re-upload
OCR	Optical Character Recognition	Turning pictures of text (scans) into machine-readable text
DBOS	Database-Oriented Operating System	The durable-workflow library behind crash-resumable ingestion
SVG	Scalable Vector Graphics	The browser’s built-in vector drawing format

No results found