Architecture Deep Dives |

Part 3 · CogniVault Architecture: Why We Keep Ollama Out of Docker

Wed, 03 Jun 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

The golden rule of modern software deployment is containerization. Put everything in Docker to isolate the dependencies, and it will run the exact same way on every machine.

When initially designing CogniVault, the impulse was to put the FastAPI server, the PostgreSQL database, and the Ollama LLM engine all inside a single, secure Docker network.

But we didn’t. We left Ollama running natively on the host machine. Let’s break down why.

The GPU Passthrough Problem

Think of your GPU like the kitchen in a restaurant. The chefs (your AI models) need to be in the kitchen — standing at the stove, hands on the equipment. Now imagine telling the chefs they must cook from a sealed meeting room down the hall, passing instructions through a serving hatch. Technically food might still come out. It will not come out fast.

That sealed room is a container. Large Language Models like Gemma 4 need direct, unhindered access to your hardware’s GPU (like Apple Silicon’s Unified Memory or a dedicated Nvidia card) to generate text fast enough for a real-time chat interface. And the picture varies by platform:

On macOS, Docker runs containers inside a lightweight virtual machine — and there is currently no GPU (Metal) passthrough at all. An Ollama container on a Mac runs CPU-only. For a chat app, that’s disqualifying on its own.
On Linux, Nvidia GPU passthrough exists and works, but it requires extra toolkit configuration that breaks the “it just works” philosophy of local development.

Running Ollama natively sidesteps the whole category of problems.

The Bridge Solution

CogniVault uses a split deployment model, separating the application logic from the heavy AI processing.

The Secure Rooms (Docker): PostgreSQL — which holds the DBOS workflow ledger from — lives in a Docker Bridge Network (a private virtual network). Isolated, clean, reproducible.
The Main Building (Native Host): Ollama runs directly on your Mac, Windows, or Linux host OS, giving it direct metal access to your GPU.

CogniVault actually ships two run modes, and it’s worth being precise about them:

The default (scripts/start.sh): only PostgreSQL runs in Docker. The FastAPI backend runs natively too (python -m backend.main), right next to Ollama. Simplest possible loop for local development.
The fully containerized mode (docker-compose.yaml): the FastAPI app joins Postgres inside the compose network. In this mode the app container reaches the native Ollama engine through a special Docker routing address: host.docker.internal:11434.

Either way, the rule stays the same: the model never goes in the box.

graph TD Client[📱 Browser / User] -->|HTTP: 8000| App subgraph Host Machine [Host OS: Native GPU Access] Ollama[🧠 Ollama Engine] Models[(gemma4:e4b)] Ollama <--> Models subgraph Docker Compose Network App[🖥️ FastAPI App Container] Postgres[(🐘 PostgreSQL)] App <-->|Internal Port 5432| Postgres end App <-->|host.docker.internal:11434| Ollama end

What about the Vector Database?

You might notice FAISS isn’t a container here. Unlike massive SQL databases, FAISS is extremely lightweight. In CogniVault, FAISS runs directly inside the FastAPI Python process’s memory and saves its data to a local folder. It doesn’t need its own container.

By keeping the heavy LLM lifting on the metal and the bookkeeping in containers, we get the balance that notoriously trips up local AI development: zero dependency conflicts combined with maximum AI performance.

See It In Action

That wraps up the CogniVault architecture series! If you want to run this 100% local, privacy-first Study Companion on your own hardware:

Grab the code:
Watch the walkthrough:

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
GPU	Graphics Processing Unit	The hardware that makes local model inference fast; containers struggle to reach it
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
AI	Artificial Intelligence	Software performing tasks that normally need human intelligence
API	Application Programming Interface	The set of URLs the frontend calls to talk to the backend
HTTP	HyperText Transfer Protocol	The protocol browsers and APIs use to exchange requests and responses
OS	Operating System	macOS, Windows, or Linux — where Ollama runs natively
DBOS	Database-Oriented Operating System	The durable-workflow library whose ledger lives in the Postgres container (see Part 2)
SQL	Structured Query Language	The language of relational databases like PostgreSQL
FAISS	Facebook AI Similarity Search	The in-process vector index — deliberately not a separate container
VM	Virtual Machine	The hidden layer Docker uses on macOS — and the reason Mac containers can’t reach the GPU

Part 2 · CogniVault Architecture: Durable Ingestion with DBOS

Tue, 02 Jun 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

In a basic local AI setup, adding documents to your database is usually just a simple Python script. You open a PDF, chop the text into chunks, turn those chunks into math (embeddings), and save them.

This works great for a five-page essay. But what happens when you are ingesting a 1,000-page technical manual and your laptop goes to sleep at page 800?

The script dies. When you wake your laptop up, you have to start all over from page 1, wasting time and compute power. A simple script wasn’t going to cut it for CogniVault. We needed a Durable Workflow.

The Factory Ledger (DBOS)

Think of data ingestion like a factory assembly line. If the power goes out, the workers shouldn’t have to rebuild every product from scratch. They should just look at a permanent ledger, see exactly which box they were packing when the lights went out, and resume from there.

CogniVault uses a framework called DBOS (Database-Oriented Operating System) backed by a PostgreSQL database to act as this ledger.

Every step of the ingestion process records its completion in Postgres. If the server crashes mid-way, nothing dramatic happens in the moment — the magic is on restart: DBOS reads the ledger, sees which steps already finished, replays their recorded results instantly, and resumes from the first unfinished step.

One important boundary: Postgres holds only the ledger — which steps ran and what they returned. Your documents, chunks, and vectors never live there. They go to a FAISS index plus a JSON metadata file on disk.

SHA-256 Hashing: The Idempotency Trick

The system also needs to be smart about re-uploads. If you fix a typo in a massive document and upload it again, you don’t want the system to waste 10 minutes re-embedding the whole thing.

CogniVault achieves Idempotency (the ability to run the same operation multiple times without changing the result beyond the initial application) with the workflow’s very first step: it scans the docs/ folder and generates a SHA-256 hash (a unique digital fingerprint) for every file.

If the hash is new, it processes the file.
If the hash has changed (because you edited the file), it soft-deletes the old text chunks and only re-embeds the new version.
If the hash is identical, it skips the file entirely.

We can see here how this flows logically:

graph TD Raw[📄 Uploaded Document] --> DBOS[🐘 DBOS Workflow Starts] subgraph Durable Ingestion Pipeline DBOS -->|Step 1| Hash{Hash Check SHA-256} Hash -->|Unchanged| Skip[Skip Processing] Hash -->|New / Changed| Extract[✂️ Step 2: Extract Text per Document] Extract --> Chunk[Chunk: 1000 chars, 100 overlap] Chunk -->|Step 3, batches of 5| Embed[🔢 embeddinggemma Embeddings] Embed -->|Step 4| Save[(💾 FAISS Index + Metadata JSON)] end Save -->|Workflow Complete| Done[✅ Ready for Search]

(A detail for the curious: the checkpointed steps are the scan, the per-document extraction, each embedding batch, and the save. The chunking in between is fast pure-Python work, so it simply re-runs as part of the workflow body — checkpointing it would cost more than redoing it.)

What’s Next?

By wrapping the ingestion pipeline in DBOS, the system transforms from a fragile script into a resilient, production-grade state machine.

Now that our data is safely ingested, how do we deploy this entire pipeline without melting our laptop’s GPU? Read Part 3: Why We Keep Ollama Out of Docker

You can also explore the DBOS implementation directly in the backend/services/ingest.py file of the .

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
DBOS	Database-Oriented Operating System	A library that checkpoints workflow steps in a database so crashed jobs resume instead of restarting
SHA-256	Secure Hash Algorithm, 256-bit	A fingerprint function: any file maps to a unique 64-character hash; change one byte and the hash changes completely
PDF	Portable Document Format	The document format whose text (and scans) the pipeline extracts
FAISS	Facebook AI Similarity Search	Meta’s vector-search library — where the embeddings actually live
JSON	JavaScript Object Notation	The text format used for the chunk-metadata file stored next to the FAISS index
AI	Artificial Intelligence	Software performing tasks that normally need human intelligence
GPU	Graphics Processing Unit	The hardware that makes local model inference fast — the subject of Part 3

Part 1 · CogniVault Architecture: Why Standard RAG Isn't Enough (Hybrid Search)

Mon, 01 Jun 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

Vector search is the process of finding the most similar items in a dataset based on their vector embeddings. This is how RAG systems usually work. But what happens when you need to find the most similar items in a dataset based not only on their semantic meaning but also on the exact wording of the query?

This becomes critical when the information you’re looking for isn’t just related but must match a specific string or keyword exactly.

Two ways of finding a book

Picture a good local bookshop. The owner has read everything, and she recommends by feel. Tell her you loved The Martian and she hands you Project Hail Mary — different title, different plot, but the same DNA: a lone scientist, an impossible survival problem, jokes under pressure. Ask for “something like Pride and Prejudice” and you’ll walk out with Emma. She isn’t matching words. She’s matching meaning.

Now ask her a different kind of question: “I need the book with ISBN 978-0-553-41802-6,” or “the manual that mentions error code 404B on the cover.” Her superpower is useless here. No amount of literary intuition finds an exact string. For that, you walk to the till and check the catalogue — a boring, literal index that knows exactly which shelf holds which identifier, and nothing about vibes.

A well-run bookshop needs both. So does a well-run RAG system:

FAISS — Facebook AI Similarity Search (the well-read owner): a vector index that finds chunks of text whose meaning is mathematically close to your prompt. Brilliant for “how is the practical exam structured?”, blind to “§3 Absatz 2”.
BM25 — Best Match 25 (the catalogue): a classic keyword-scoring algorithm that rewards exact word matches, weighted by how rare and distinctive those words are. Brilliant for identifiers and quoted phrases, blind to paraphrase.

CogniVault runs both retrievers on every search — this is Hybrid Search — and then merges the two ranked lists with a formula called Reciprocal Rank Fusion (RRF). RRF scores each chunk purely by its position in each list: a chunk ranked highly by either retriever scores well, and a chunk both retrievers agree on rises to the top. Because only ranks are used, the two retrievers’ incompatible scoring scales never have to be reconciled.

The agent decides when to search

Here’s the part most diagrams get backwards (mine included, in an earlier draft): retrieval doesn’t happen before the model gets involved. It happens inside the model’s own loop.

CogniVault wraps Gemma in the Strands Agents SDK. The model receives your question along with a set of Tools (pre-written Python functions like search_knowledge_base, calculator, or compare_documents). It then reasons about the question and decides for itself whether — and which — tools to call. For most document questions it calls search_knowledge_base, reads the retrieved chunks, and only then writes its answer, grounded in what it found.

Here is the architectural blueprint of that loop:

graph TD Client[📱 User Query] --> App[🖥️ FastAPI Server] subgraph AgentLoop["The Strands Agent Loop (powered by Gemma 4)"] App --> Agent[🧠 Agent reasons about the question] Agent -->|Decides to search| Search[search_knowledge_base] subgraph Hybrid Search Engine Search -->|Semantic| FAISS[(FAISS Vector)] Search -->|Exact match| BM25[(BM25 Keyword)] FAISS --> RRF{RRF Fusion} BM25 --> RRF end RRF -->|Best chunks + citations| Agent Agent -->|Grounded answer| Answer[Streamed response] end Answer --> Client

One subtlety worth noting: the agent is Gemma. There is no separate “formatting model” at the end — the same model that decided to search also writes the final answer, now with the retrieved chunks in front of it.

What’s Next?

Building a toy RAG app is easy, but building one that actually retrieves the exact document you need requires hybrid engines and an agent that knows when to use them.

Want to see how this system safely ingests massive documents without losing work when something crashes? Read Part 2: Durable Ingestion with DBOS

Or, if you prefer to jump straight into the code, the hybrid search lives in backend/services/vector_db.py of the .

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory
FAISS	Facebook AI Similarity Search	Meta’s library for storing vectors and finding the most similar ones fast
BM25	Best Match 25	A keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system
RRF	Reciprocal Rank Fusion	A formula that merges multiple ranked lists using only each item’s rank: `score = Σ 1/(k + rank)`
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
SDK	Software Development Kit	A library of building blocks — here, Strands, which provides the agent loop
API	Application Programming Interface	The set of URLs the frontend calls to talk to the backend
ISBN	International Standard Book Number	The unique identifier printed on every published book — the catalogue’s best friend