Part 1 · Why I Built a Local-First RAG

Apr 20, 2026·
Ndimofor Aretas
Ndimofor Aretas
· 5 min read
blog AI Engineering

All abbreviations are fully explained in the appendix at the bottom of the page.

I’ve spent the last few years in front of virtual classrooms full of career-changers in Germany, walking them through programming basics, web development, and introductory AI courses. Most of the information we deal with is fine to paste into cloud-based AI tools. Some of it really isn’t.

Exam materials under confidentiality. A trainee’s portfolio with personal details. Other private documents that should never end up training someone else’s model.

So I built Gemma CogniVault — a fully local AI study and productivity tool. No cloud. No telemetry. No “we may use this data to improve our service.” Just Gemma 4 running on Ollama, on my laptop, talking to my files.

The leaky abstraction

The pitch for cloud AI is great: a giant model, available instantly, billed by the token. The fine print is where it gets uncomfortable:

  • Where does the data physically live during inference?
  • Whose jurisdiction governs that hardware this afternoon?
  • Does the audit trail stop at the API boundary, or can you actually trace what happened to your bytes?
  • When you tick “do not train on my data,” are you trusting a control, a contract, or both?

For most consumer use cases, those questions are fine to wave away. For education, healthcare, finance, legal, public administration — the answer “trust us” isn’t an answer.

What “local-first” actually means here

Lots of products say “private.” I wanted three concrete properties:

  1. The model lives on your machine. Gemma 4 (gemma4:e4b) and embeddinggemma are pulled via Ollama. Inference is a localhost HTTP call.
  2. Your documents never leave. Vectors, chunks, chat history, study sessions, achievements — all on disk on your computer.
  3. You can verify it. Gemma CogniVault ships a Privacy Audit Panel that shows a live “zero external connections” indicator alongside document counts and the Ollama host. It’s not a promise — it’s a status light.

If a future build of Gemma CogniVault ever made an outbound call, that panel would be the first thing to scream.

What you get back

Going local sounds like a trade-off — surely you lose the magic of the giant frontier models? In practice, with Gemma 4 you get more than enough:

  • Thinking mode — Gemma 4’s chain-of-thought streams into a collapsible panel before the answer. Watching the model reason about your documents is genuinely useful as a teaching tool.
  • Tool use — through the Strands Agents SDK, the model decides when to search the knowledge base, summarise a document, compare two files, or check the time.
  • Vision — attach images and PDFs straight into a chat turn.
  • Generation that’s actually structured — quizzes, multi-lesson workshops, flashcard decks, and interactive mindmaps, generated with format="json" so the output parses reliably.

Cognivault doesn’t try to be a giant ecosystem. It’s a single-purpose tool that does one thing well: use your own documents with a capable local model in a private environment. I must admit that it was inspired to a great extent by NotebookLM, which I’ve found incredibly useful but not private enough for my needs.

The shape of the app

CogniVault is split into four sections that map to how I actually work with information on cloud-based AI tools:

SectionWhat it’s for
ChatAsk anything about your documents. Cited answers, scope filter, voice in.
Knowledge BaseUpload, categorise, manage. SHA-256 detects edits on re-upload.
Study HubQuiz · Workshop · Flashcards · Mindmaps — four ways to drill into the source.
DashboardTotal study time, streak, 25 badges, GitHub-style 90-day heatmap.

Everything reachable from a sidebar that remembers where you left off, on a stack that fits in your ~/Documents folder.

What comes next

This is the first in a short series. Over the next few posts I’ll dig into the parts I’m most proud of — and a few I’d build differently next time:

  • Hybrid retrieval — why FAISS and BM25, fused with Reciprocal Rank Fusion
  • Two-phase streaming with Gemma 4 and Strands Agents
  • Crash-resumable ingestion with DBOS, hash-aware re-ingest, OCR fallback
  • Getting reliable JSON out of a local LLM (and what to do when it fails)
  • The mindmap renderer — what hand-rolling SVG taught me, and why v2 uses React Flow
  • Gamifying learning — 25 badges, idle-gap sessions, 90-day heatmap
  • Testing a local-AI app with 350+ tests and zero infrastructure

If you want to skip ahead, the code is open source at github.com/ndimoforaretas/local-gemma-rag, and there’s a demo walkthrough on YouTube.

Your data. Your hardware. Your AI. Your vault.


Appendix: Abbreviations in this post

AbbreviationFull formMeaning
RAGRetrieval-Augmented GenerationRetrieve relevant passages from your own documents first; let the model answer from them instead of from training memory
AIArtificial IntelligenceSoftware performing tasks that normally need human intelligence
LLMLarge Language ModelA neural network trained on huge amounts of text that can read and generate language
HTTPHyperText Transfer ProtocolThe protocol browsers and APIs use to exchange requests and responses
APIApplication Programming InterfaceThe boundary where you call someone else’s software — and where cloud audit trails stop
IHKIndustrie- und HandelskammerThe German Chamber of Commerce and Industry, which administers trainer certification
AEVOAusbildereignungsverordnungThe German trainer-aptitude regulation — the exam material that motivated this project
FAISSFacebook AI Similarity SearchMeta’s vector-search library (covered in the next post)
BM25Best Match 25A classic keyword-ranking formula (also next post)
SDKSoftware Development KitA library of building blocks — here, Strands, which provides the agent loop
JSONJavaScript Object NotationThe universal text format for structured data
PDFPortable Document FormatOne of the eight-plus file types CogniVault ingests
SHA-256Secure Hash Algorithm, 256-bitA content fingerprint used to detect edited files on re-upload
OCROptical Character RecognitionTurning pictures of text (scans) into machine-readable text
DBOSDatabase-Oriented Operating SystemThe durable-workflow library behind crash-resumable ingestion
SVGScalable Vector GraphicsThe browser’s built-in vector drawing format