<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Architecture Deep Dives |</title><link>https://aretascodes.dev/categories/architecture-deep-dives/</link><atom:link href="https://aretascodes.dev/categories/architecture-deep-dives/index.xml" rel="self" type="application/rss+xml"/><description>Architecture Deep Dives</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Wed, 03 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://aretascodes.dev/media/icon_hu_2ab4f4763b27c75b.png</url><title>Architecture Deep Dives</title><link>https://aretascodes.dev/categories/architecture-deep-dives/</link></image><item><title>Part 3 · CogniVault Architecture: Why We Keep Ollama Out of Docker</title><link>https://aretascodes.dev/blog/cognivault-deployment-architecture/</link><pubDate>Wed, 03 Jun 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/cognivault-deployment-architecture/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The golden rule of modern software deployment is containerization. Put everything in Docker to isolate the dependencies, and it will run the exact same way on every machine.&lt;/p&gt;
&lt;p&gt;When initially designing CogniVault, the impulse was to put the FastAPI server, the PostgreSQL database, and the Ollama LLM engine all inside a single, secure Docker network.&lt;/p&gt;
&lt;p&gt;But we didn&amp;rsquo;t. We left Ollama running natively on the host machine. Let&amp;rsquo;s break down why.&lt;/p&gt;
&lt;h2 id="the-gpu-passthrough-problem"&gt;The GPU Passthrough Problem&lt;/h2&gt;
&lt;p&gt;Think of your GPU like the kitchen in a restaurant. The chefs (your AI models) need to &lt;em&gt;be in the kitchen&lt;/em&gt; — standing at the stove, hands on the equipment. Now imagine telling the chefs they must cook from a sealed meeting room down the hall, passing instructions through a serving hatch. Technically food might still come out. It will not come out fast.&lt;/p&gt;
&lt;p&gt;That sealed room is a container. Large Language Models like Gemma 4 need direct, unhindered access to your hardware&amp;rsquo;s GPU (like Apple Silicon&amp;rsquo;s Unified Memory or a dedicated Nvidia card) to generate text fast enough for a real-time chat interface. And the picture varies by platform:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;On macOS&lt;/strong&gt;, Docker runs containers inside a lightweight virtual machine — and there is currently &lt;strong&gt;no GPU (Metal) passthrough at all&lt;/strong&gt;. An Ollama container on a Mac runs CPU-only. For a chat app, that&amp;rsquo;s disqualifying on its own.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;On Linux&lt;/strong&gt;, Nvidia GPU passthrough exists and works, but it requires extra toolkit configuration that breaks the &amp;ldquo;it just works&amp;rdquo; philosophy of local development.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Running Ollama natively sidesteps the whole category of problems.&lt;/p&gt;
&lt;h2 id="the-bridge-solution"&gt;The Bridge Solution&lt;/h2&gt;
&lt;p&gt;CogniVault uses a split deployment model, separating the application logic from the heavy AI processing.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The Secure Rooms (Docker):&lt;/strong&gt; PostgreSQL — which holds the DBOS workflow ledger from
— lives in a &lt;strong&gt;Docker Bridge Network&lt;/strong&gt; (a private virtual network). Isolated, clean, reproducible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Main Building (Native Host):&lt;/strong&gt; Ollama runs directly on your Mac, Windows, or Linux host OS, giving it direct metal access to your GPU.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;CogniVault actually ships &lt;strong&gt;two run modes&lt;/strong&gt;, and it&amp;rsquo;s worth being precise about them:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The default (&lt;code&gt;scripts/start.sh&lt;/code&gt;):&lt;/strong&gt; only PostgreSQL runs in Docker. The FastAPI backend runs natively too (&lt;code&gt;python -m backend.main&lt;/code&gt;), right next to Ollama. Simplest possible loop for local development.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The fully containerized mode (&lt;code&gt;docker-compose.yaml&lt;/code&gt;):&lt;/strong&gt; the FastAPI app joins Postgres inside the compose network. In this mode the app container reaches the native Ollama engine through a special Docker routing address: &lt;code&gt;host.docker.internal:11434&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Either way, the rule stays the same: &lt;strong&gt;the model never goes in the box.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="mermaid"&gt;graph TD
Client[📱 Browser / User] --&gt;|HTTP: 8000| App
subgraph Host Machine [Host OS: Native GPU Access]
Ollama[🧠 Ollama Engine]
Models[(gemma4:e4b)]
Ollama &lt;--&gt; Models
subgraph Docker Compose Network
App[🖥️ FastAPI App Container]
Postgres[(🐘 PostgreSQL)]
App &lt;--&gt;|Internal Port 5432| Postgres
end
App &lt;--&gt;|host.docker.internal:11434| Ollama
end
&lt;/div&gt;
&lt;h3 id="what-about-the-vector-database"&gt;What about the Vector Database?&lt;/h3&gt;
&lt;p&gt;You might notice FAISS isn&amp;rsquo;t a container here. Unlike massive SQL databases, FAISS is extremely lightweight. In CogniVault, FAISS runs directly inside the FastAPI Python process&amp;rsquo;s memory and saves its data to a local folder. It doesn&amp;rsquo;t need its own container.&lt;/p&gt;
&lt;p&gt;By keeping the heavy LLM lifting on the metal and the bookkeeping in containers, we get the balance that notoriously trips up local AI development: zero dependency conflicts combined with maximum AI performance.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="see-it-in-action"&gt;See It In Action&lt;/h3&gt;
&lt;p&gt;That wraps up the CogniVault architecture series! If you want to run this 100% local, privacy-first Study Companion on your own hardware:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Grab the code:&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Watch the walkthrough:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Graphics Processing Unit&lt;/td&gt;
&lt;td&gt;The hardware that makes local model inference fast; containers struggle to reach it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large Language Model&lt;/td&gt;
&lt;td&gt;A neural network trained on huge amounts of text that can read and generate language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Artificial Intelligence&lt;/td&gt;
&lt;td&gt;Software performing tasks that normally need human intelligence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The set of URLs the frontend calls to talk to the backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HyperText Transfer Protocol&lt;/td&gt;
&lt;td&gt;The protocol browsers and APIs use to exchange requests and responses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Operating System&lt;/td&gt;
&lt;td&gt;macOS, Windows, or Linux — where Ollama runs natively&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database-Oriented Operating System&lt;/td&gt;
&lt;td&gt;The durable-workflow library whose ledger lives in the Postgres container (see Part 2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SQL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured Query Language&lt;/td&gt;
&lt;td&gt;The language of relational databases like PostgreSQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;The in-process vector index — deliberately &lt;em&gt;not&lt;/em&gt; a separate container&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Virtual Machine&lt;/td&gt;
&lt;td&gt;The hidden layer Docker uses on macOS — and the reason Mac containers can&amp;rsquo;t reach the GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;</description></item><item><title>Part 2 · CogniVault Architecture: Durable Ingestion with DBOS</title><link>https://aretascodes.dev/blog/cognivault-ingestion-pipeline/</link><pubDate>Tue, 02 Jun 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/cognivault-ingestion-pipeline/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In a basic local AI setup, adding documents to your database is usually just a simple Python script. You open a PDF, chop the text into chunks, turn those chunks into math (embeddings), and save them.&lt;/p&gt;
&lt;p&gt;This works great for a five-page essay. But what happens when you are ingesting a 1,000-page technical manual and your laptop goes to sleep at page 800?&lt;/p&gt;
&lt;p&gt;The script dies. When you wake your laptop up, you have to start all over from page 1, wasting time and compute power. A simple script wasn&amp;rsquo;t going to cut it for CogniVault. We needed a &lt;strong&gt;Durable Workflow&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="the-factory-ledger-dbos"&gt;The Factory Ledger (DBOS)&lt;/h2&gt;
&lt;p&gt;Think of data ingestion like a factory assembly line. If the power goes out, the workers shouldn&amp;rsquo;t have to rebuild every product from scratch. They should just look at a permanent ledger, see exactly which box they were packing when the lights went out, and resume from there.&lt;/p&gt;
&lt;p&gt;CogniVault uses a framework called &lt;strong&gt;DBOS (Database-Oriented Operating System)&lt;/strong&gt; backed by a PostgreSQL database to act as this ledger.&lt;/p&gt;
&lt;p&gt;Every step of the ingestion process records its completion in Postgres. If the server crashes mid-way, nothing dramatic happens in the moment — the magic is on restart: DBOS reads the ledger, sees which steps already finished, replays their recorded results instantly, and resumes from the first unfinished step.&lt;/p&gt;
&lt;p&gt;One important boundary: Postgres holds &lt;strong&gt;only the ledger&lt;/strong&gt; — which steps ran and what they returned. Your documents, chunks, and vectors never live there. They go to a FAISS index plus a JSON metadata file on disk.&lt;/p&gt;
&lt;h2 id="sha-256-hashing-the-idempotency-trick"&gt;SHA-256 Hashing: The Idempotency Trick&lt;/h2&gt;
&lt;p&gt;The system also needs to be smart about re-uploads. If you fix a typo in a massive document and upload it again, you don&amp;rsquo;t want the system to waste 10 minutes re-embedding the whole thing.&lt;/p&gt;
&lt;p&gt;CogniVault achieves &lt;strong&gt;Idempotency&lt;/strong&gt; (the ability to run the same operation multiple times without changing the result beyond the initial application) with the workflow&amp;rsquo;s very first step: it scans the &lt;code&gt;docs/&lt;/code&gt; folder and generates a &lt;strong&gt;SHA-256 hash&lt;/strong&gt; (a unique digital fingerprint) for every file.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If the hash is new, it processes the file.&lt;/li&gt;
&lt;li&gt;If the hash has changed (because you edited the file), it soft-deletes the old text chunks and only re-embeds the new version.&lt;/li&gt;
&lt;li&gt;If the hash is identical, it skips the file entirely.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can see here how this flows logically:&lt;/p&gt;
&lt;div class="mermaid"&gt;graph TD
Raw[📄 Uploaded Document] --&gt; DBOS[🐘 DBOS Workflow Starts]
subgraph Durable Ingestion Pipeline
DBOS --&gt;|Step 1| Hash{Hash Check SHA-256}
Hash --&gt;|Unchanged| Skip[Skip Processing]
Hash --&gt;|New / Changed| Extract[✂️ Step 2: Extract Text per Document]
Extract --&gt; Chunk[Chunk: 1000 chars, 100 overlap]
Chunk --&gt;|Step 3, batches of 5| Embed[🔢 embeddinggemma Embeddings]
Embed --&gt;|Step 4| Save[(💾 FAISS Index + Metadata JSON)]
end
Save --&gt;|Workflow Complete| Done[✅ Ready for Search]
&lt;/div&gt;
&lt;p&gt;(A detail for the curious: the checkpointed &lt;em&gt;steps&lt;/em&gt; are the scan, the per-document extraction, each embedding batch, and the save. The chunking in between is fast pure-Python work, so it simply re-runs as part of the workflow body — checkpointing it would cost more than redoing it.)&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="whats-next"&gt;What&amp;rsquo;s Next?&lt;/h3&gt;
&lt;p&gt;By wrapping the ingestion pipeline in DBOS, the system transforms from a fragile script into a resilient, production-grade state machine.&lt;/p&gt;
&lt;p&gt;Now that our data is safely ingested, how do we deploy this entire pipeline without melting our laptop&amp;rsquo;s GPU?
&lt;strong&gt;Read Part 3: Why We Keep Ollama Out of Docker&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;You can also explore the DBOS implementation directly in the &lt;code&gt;backend/services/ingest.py&lt;/code&gt; file of the
.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database-Oriented Operating System&lt;/td&gt;
&lt;td&gt;A library that checkpoints workflow steps in a database so crashed jobs resume instead of restarting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SHA-256&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secure Hash Algorithm, 256-bit&lt;/td&gt;
&lt;td&gt;A fingerprint function: any file maps to a unique 64-character hash; change one byte and the hash changes completely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Portable Document Format&lt;/td&gt;
&lt;td&gt;The document format whose text (and scans) the pipeline extracts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;Meta&amp;rsquo;s vector-search library — where the embeddings actually live&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The text format used for the chunk-metadata file stored next to the FAISS index&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Artificial Intelligence&lt;/td&gt;
&lt;td&gt;Software performing tasks that normally need human intelligence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Graphics Processing Unit&lt;/td&gt;
&lt;td&gt;The hardware that makes local model inference fast — the subject of Part 3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;</description></item><item><title>Part 1 · CogniVault Architecture: Why Standard RAG Isn't Enough (Hybrid Search)</title><link>https://aretascodes.dev/blog/cognivault-retrieval-loop/</link><pubDate>Mon, 01 Jun 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/cognivault-retrieval-loop/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Vector search is the process of finding the most similar items in a dataset based on their vector embeddings. This is how RAG systems usually work.
But what happens when you need to find the most similar items in a dataset based not only on their semantic meaning but also on the exact wording of the query?&lt;/p&gt;
&lt;p&gt;This becomes critical when the information you&amp;rsquo;re looking for isn&amp;rsquo;t just related but must match a specific string or keyword exactly.&lt;/p&gt;
&lt;h2 id="two-ways-of-finding-a-book"&gt;Two ways of finding a book&lt;/h2&gt;
&lt;p&gt;Picture a good local bookshop. The owner has read everything, and she recommends by &lt;em&gt;feel&lt;/em&gt;. Tell her you loved &lt;em&gt;The Martian&lt;/em&gt; and she hands you &lt;em&gt;Project Hail Mary&lt;/em&gt; — different title, different plot, but the same DNA: a lone scientist, an impossible survival problem, jokes under pressure. Ask for &amp;ldquo;something like &lt;em&gt;Pride and Prejudice&lt;/em&gt;&amp;rdquo; and you&amp;rsquo;ll walk out with &lt;em&gt;Emma&lt;/em&gt;. She isn&amp;rsquo;t matching words. She&amp;rsquo;s matching &lt;em&gt;meaning&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Now ask her a different kind of question: &amp;ldquo;I need the book with ISBN 978-0-553-41802-6,&amp;rdquo; or &amp;ldquo;the manual that mentions error code 404B on the cover.&amp;rdquo; Her superpower is useless here. No amount of literary intuition finds an exact string. For that, you walk to the till and check the &lt;strong&gt;catalogue&lt;/strong&gt; — a boring, literal index that knows exactly which shelf holds which identifier, and nothing about vibes.&lt;/p&gt;
&lt;p&gt;A well-run bookshop needs both. So does a well-run RAG system:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;FAISS — Facebook AI Similarity Search (the well-read owner):&lt;/strong&gt; a vector index that finds chunks of text whose &lt;em&gt;meaning&lt;/em&gt; is mathematically close to your prompt. Brilliant for &amp;ldquo;how is the practical exam structured?&amp;rdquo;, blind to &amp;ldquo;§3 Absatz 2&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BM25 — Best Match 25 (the catalogue):&lt;/strong&gt; a classic keyword-scoring algorithm that rewards exact word matches, weighted by how rare and distinctive those words are. Brilliant for identifiers and quoted phrases, blind to paraphrase.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;CogniVault runs &lt;strong&gt;both&lt;/strong&gt; retrievers on every search — this is &lt;strong&gt;Hybrid Search&lt;/strong&gt; — and then merges the two ranked lists with a formula called &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;. RRF scores each chunk purely by its &lt;em&gt;position&lt;/em&gt; in each list: a chunk ranked highly by either retriever scores well, and a chunk both retrievers agree on rises to the top. Because only ranks are used, the two retrievers&amp;rsquo; incompatible scoring scales never have to be reconciled.&lt;/p&gt;
&lt;h2 id="the-agent-decides-when-to-search"&gt;The agent decides when to search&lt;/h2&gt;
&lt;p&gt;Here&amp;rsquo;s the part most diagrams get backwards (mine included, in an earlier draft): retrieval doesn&amp;rsquo;t happen &lt;em&gt;before&lt;/em&gt; the model gets involved. It happens &lt;em&gt;inside&lt;/em&gt; the model&amp;rsquo;s own loop.&lt;/p&gt;
&lt;p&gt;CogniVault wraps Gemma in the &lt;strong&gt;Strands Agents SDK&lt;/strong&gt;. The model receives your question along with a set of &lt;strong&gt;Tools&lt;/strong&gt; (pre-written Python functions like &lt;code&gt;search_knowledge_base&lt;/code&gt;, &lt;code&gt;calculator&lt;/code&gt;, or &lt;code&gt;compare_documents&lt;/code&gt;). It then reasons about the question and &lt;em&gt;decides for itself&lt;/em&gt; whether — and which — tools to call. For most document questions it calls &lt;code&gt;search_knowledge_base&lt;/code&gt;, reads the retrieved chunks, and only then writes its answer, grounded in what it found.&lt;/p&gt;
&lt;p&gt;Here is the architectural blueprint of that loop:&lt;/p&gt;
&lt;div class="mermaid"&gt;graph TD
Client[📱 User Query] --&gt; App[🖥️ FastAPI Server]
subgraph AgentLoop["The Strands Agent Loop (powered by Gemma 4)"]
App --&gt; Agent[🧠 Agent reasons about the question]
Agent --&gt;|Decides to search| Search[search_knowledge_base]
subgraph Hybrid Search Engine
Search --&gt;|Semantic| FAISS[(FAISS Vector)]
Search --&gt;|Exact match| BM25[(BM25 Keyword)]
FAISS --&gt; RRF{RRF Fusion}
BM25 --&gt; RRF
end
RRF --&gt;|Best chunks + citations| Agent
Agent --&gt;|Grounded answer| Answer[Streamed response]
end
Answer --&gt; Client
&lt;/div&gt;
&lt;p&gt;One subtlety worth noting: the agent &lt;em&gt;is&lt;/em&gt; Gemma. There is no separate &amp;ldquo;formatting model&amp;rdquo; at the end — the same model that decided to search also writes the final answer, now with the retrieved chunks in front of it.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="whats-next"&gt;What&amp;rsquo;s Next?&lt;/h3&gt;
&lt;p&gt;Building a toy RAG app is easy, but building one that actually retrieves the exact document you need requires hybrid engines and an agent that knows when to use them.&lt;/p&gt;
&lt;p&gt;Want to see how this system safely ingests massive documents without losing work when something crashes?
&lt;strong&gt;Read Part 2: Durable Ingestion with DBOS&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Or, if you prefer to jump straight into the code, the hybrid search lives in &lt;code&gt;backend/services/vector_db.py&lt;/code&gt; of the
.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;Meta&amp;rsquo;s library for storing vectors and finding the most similar ones fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;A keyword-ranking formula — the 25th ranking function developed in the Okapi information-retrieval system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RRF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reciprocal Rank Fusion&lt;/td&gt;
&lt;td&gt;A formula that merges multiple ranked lists using only each item&amp;rsquo;s rank: &lt;code&gt;score = Σ 1/(k + rank)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large Language Model&lt;/td&gt;
&lt;td&gt;A neural network trained on huge amounts of text that can read and generate language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Software Development Kit&lt;/td&gt;
&lt;td&gt;A library of building blocks — here, Strands, which provides the agent loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application Programming Interface&lt;/td&gt;
&lt;td&gt;The set of URLs the frontend calls to talk to the backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ISBN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;International Standard Book Number&lt;/td&gt;
&lt;td&gt;The unique identifier printed on every published book — the catalogue&amp;rsquo;s best friend&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;</description></item></channel></rss>