Part 1 · Why I Built a Local-First RAG
All abbreviations are fully explained in the appendix at the bottom of the page.
I’ve spent the last few years in front of virtual classrooms full of career-changers in Germany, walking them through programming basics, web development, and introductory AI courses. Most of the information we deal with is fine to paste into cloud-based AI tools. Some of it really isn’t.
Exam materials under confidentiality. A trainee’s portfolio with personal details. Other private documents that should never end up training someone else’s model.
So I built Gemma CogniVault — a fully local AI study and productivity tool. No cloud. No telemetry. No “we may use this data to improve our service.” Just Gemma 4 running on Ollama, on my laptop, talking to my files.
The leaky abstraction
The pitch for cloud AI is great: a giant model, available instantly, billed by the token. The fine print is where it gets uncomfortable:
- Where does the data physically live during inference?
- Whose jurisdiction governs that hardware this afternoon?
- Does the audit trail stop at the API boundary, or can you actually trace what happened to your bytes?
- When you tick “do not train on my data,” are you trusting a control, a contract, or both?
For most consumer use cases, those questions are fine to wave away. For education, healthcare, finance, legal, public administration — the answer “trust us” isn’t an answer.
What “local-first” actually means here
Lots of products say “private.” I wanted three concrete properties:
- The model lives on your machine. Gemma 4 (
gemma4:e4b) andembeddinggemmaare pulled via Ollama. Inference is a localhost HTTP call. - Your documents never leave. Vectors, chunks, chat history, study sessions, achievements — all on disk on your computer.
- You can verify it. Gemma CogniVault ships a Privacy Audit Panel that shows a live “zero external connections” indicator alongside document counts and the Ollama host. It’s not a promise — it’s a status light.
If a future build of Gemma CogniVault ever made an outbound call, that panel would be the first thing to scream.
What you get back
Going local sounds like a trade-off — surely you lose the magic of the giant frontier models? In practice, with Gemma 4 you get more than enough:
- Thinking mode — Gemma 4’s chain-of-thought streams into a collapsible panel before the answer. Watching the model reason about your documents is genuinely useful as a teaching tool.
- Tool use — through the Strands Agents SDK, the model decides when to search the knowledge base, summarise a document, compare two files, or check the time.
- Vision — attach images and PDFs straight into a chat turn.
- Generation that’s actually structured — quizzes, multi-lesson workshops, flashcard decks, and interactive mindmaps, generated with
format="json"so the output parses reliably.
Cognivault doesn’t try to be a giant ecosystem. It’s a single-purpose tool that does one thing well: use your own documents with a capable local model in a private environment. I must admit that it was inspired to a great extent by NotebookLM, which I’ve found incredibly useful but not private enough for my needs.
The shape of the app
CogniVault is split into four sections that map to how I actually work with information on cloud-based AI tools:
| Section | What it’s for |
|---|---|
| Chat | Ask anything about your documents. Cited answers, scope filter, voice in. |
| Knowledge Base | Upload, categorise, manage. SHA-256 detects edits on re-upload. |
| Study Hub | Quiz · Workshop · Flashcards · Mindmaps — four ways to drill into the source. |
| Dashboard | Total study time, streak, 25 badges, GitHub-style 90-day heatmap. |
Everything reachable from a sidebar that remembers where you left off, on a stack that fits in your ~/Documents folder.
What comes next
This is the first in a short series. Over the next few posts I’ll dig into the parts I’m most proud of — and a few I’d build differently next time:
- Hybrid retrieval — why FAISS and BM25, fused with Reciprocal Rank Fusion
- Two-phase streaming with Gemma 4 and Strands Agents
- Crash-resumable ingestion with DBOS, hash-aware re-ingest, OCR fallback
- Getting reliable JSON out of a local LLM (and what to do when it fails)
- The mindmap renderer — what hand-rolling SVG taught me, and why v2 uses React Flow
- Gamifying learning — 25 badges, idle-gap sessions, 90-day heatmap
- Testing a local-AI app with 350+ tests and zero infrastructure
If you want to skip ahead, the code is open source at github.com/ndimoforaretas/local-gemma-rag, and there’s a demo walkthrough on YouTube.
Your data. Your hardware. Your AI. Your vault.
Appendix: Abbreviations in this post
| Abbreviation | Full form | Meaning |
|---|---|---|
| RAG | Retrieval-Augmented Generation | Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory |
| AI | Artificial Intelligence | Software performing tasks that normally need human intelligence |
| LLM | Large Language Model | A neural network trained on huge amounts of text that can read and generate language |
| HTTP | HyperText Transfer Protocol | The protocol browsers and APIs use to exchange requests and responses |
| API | Application Programming Interface | The boundary where you call someone else’s software — and where cloud audit trails stop |
| IHK | Industrie- und Handelskammer | The German Chamber of Commerce and Industry, which administers trainer certification |
| AEVO | Ausbildereignungsverordnung | The German trainer-aptitude regulation — the exam material that motivated this project |
| FAISS | Facebook AI Similarity Search | Meta’s vector-search library (covered in the next post) |
| BM25 | Best Match 25 | A classic keyword-ranking formula (also next post) |
| SDK | Software Development Kit | A library of building blocks — here, Strands, which provides the agent loop |
| JSON | JavaScript Object Notation | The universal text format for structured data |
| Portable Document Format | One of the eight-plus file types CogniVault ingests | |
| SHA-256 | Secure Hash Algorithm, 256-bit | A content fingerprint used to detect edited files on re-upload |
| OCR | Optical Character Recognition | Turning pictures of text (scans) into machine-readable text |
| DBOS | Database-Oriented Operating System | The durable-workflow library behind crash-resumable ingestion |
| SVG | Scalable Vector Graphics | The browser’s built-in vector drawing format |
