<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>PostgreSQL |</title><link>https://aretascodes.dev/tags/postgresql/</link><atom:link href="https://aretascodes.dev/tags/postgresql/index.xml" rel="self" type="application/rss+xml"/><description>PostgreSQL</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Tue, 02 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://aretascodes.dev/media/icon_hu_2ab4f4763b27c75b.png</url><title>PostgreSQL</title><link>https://aretascodes.dev/tags/postgresql/</link></image><item><title>Part 2 · CogniVault Architecture: Durable Ingestion with DBOS</title><link>https://aretascodes.dev/blog/cognivault-ingestion-pipeline/</link><pubDate>Tue, 02 Jun 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/cognivault-ingestion-pipeline/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In a basic local AI setup, adding documents to your database is usually just a simple Python script. You open a PDF, chop the text into chunks, turn those chunks into math (embeddings), and save them.&lt;/p&gt;
&lt;p&gt;This works great for a five-page essay. But what happens when you are ingesting a 1,000-page technical manual and your laptop goes to sleep at page 800?&lt;/p&gt;
&lt;p&gt;The script dies. When you wake your laptop up, you have to start all over from page 1, wasting time and compute power. A simple script wasn&amp;rsquo;t going to cut it for CogniVault. We needed a &lt;strong&gt;Durable Workflow&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="the-factory-ledger-dbos"&gt;The Factory Ledger (DBOS)&lt;/h2&gt;
&lt;p&gt;Think of data ingestion like a factory assembly line. If the power goes out, the workers shouldn&amp;rsquo;t have to rebuild every product from scratch. They should just look at a permanent ledger, see exactly which box they were packing when the lights went out, and resume from there.&lt;/p&gt;
&lt;p&gt;CogniVault uses a framework called &lt;strong&gt;DBOS (Database-Oriented Operating System)&lt;/strong&gt; backed by a PostgreSQL database to act as this ledger.&lt;/p&gt;
&lt;p&gt;Every step of the ingestion process records its completion in Postgres. If the server crashes mid-way, nothing dramatic happens in the moment — the magic is on restart: DBOS reads the ledger, sees which steps already finished, replays their recorded results instantly, and resumes from the first unfinished step.&lt;/p&gt;
&lt;p&gt;One important boundary: Postgres holds &lt;strong&gt;only the ledger&lt;/strong&gt; — which steps ran and what they returned. Your documents, chunks, and vectors never live there. They go to a FAISS index plus a JSON metadata file on disk.&lt;/p&gt;
&lt;h2 id="sha-256-hashing-the-idempotency-trick"&gt;SHA-256 Hashing: The Idempotency Trick&lt;/h2&gt;
&lt;p&gt;The system also needs to be smart about re-uploads. If you fix a typo in a massive document and upload it again, you don&amp;rsquo;t want the system to waste 10 minutes re-embedding the whole thing.&lt;/p&gt;
&lt;p&gt;CogniVault achieves &lt;strong&gt;Idempotency&lt;/strong&gt; (the ability to run the same operation multiple times without changing the result beyond the initial application) with the workflow&amp;rsquo;s very first step: it scans the &lt;code&gt;docs/&lt;/code&gt; folder and generates a &lt;strong&gt;SHA-256 hash&lt;/strong&gt; (a unique digital fingerprint) for every file.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If the hash is new, it processes the file.&lt;/li&gt;
&lt;li&gt;If the hash has changed (because you edited the file), it soft-deletes the old text chunks and only re-embeds the new version.&lt;/li&gt;
&lt;li&gt;If the hash is identical, it skips the file entirely.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can see here how this flows logically:&lt;/p&gt;
&lt;div class="mermaid"&gt;graph TD
Raw[📄 Uploaded Document] --&gt; DBOS[🐘 DBOS Workflow Starts]
subgraph Durable Ingestion Pipeline
DBOS --&gt;|Step 1| Hash{Hash Check SHA-256}
Hash --&gt;|Unchanged| Skip[Skip Processing]
Hash --&gt;|New / Changed| Extract[✂️ Step 2: Extract Text per Document]
Extract --&gt; Chunk[Chunk: 1000 chars, 100 overlap]
Chunk --&gt;|Step 3, batches of 5| Embed[🔢 embeddinggemma Embeddings]
Embed --&gt;|Step 4| Save[(💾 FAISS Index + Metadata JSON)]
end
Save --&gt;|Workflow Complete| Done[✅ Ready for Search]
&lt;/div&gt;
&lt;p&gt;(A detail for the curious: the checkpointed &lt;em&gt;steps&lt;/em&gt; are the scan, the per-document extraction, each embedding batch, and the save. The chunking in between is fast pure-Python work, so it simply re-runs as part of the workflow body — checkpointing it would cost more than redoing it.)&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id="whats-next"&gt;What&amp;rsquo;s Next?&lt;/h3&gt;
&lt;p&gt;By wrapping the ingestion pipeline in DBOS, the system transforms from a fragile script into a resilient, production-grade state machine.&lt;/p&gt;
&lt;p&gt;Now that our data is safely ingested, how do we deploy this entire pipeline without melting our laptop&amp;rsquo;s GPU?
&lt;strong&gt;Read Part 3: Why We Keep Ollama Out of Docker&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;You can also explore the DBOS implementation directly in the &lt;code&gt;backend/services/ingest.py&lt;/code&gt; file of the
.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database-Oriented Operating System&lt;/td&gt;
&lt;td&gt;A library that checkpoints workflow steps in a database so crashed jobs resume instead of restarting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SHA-256&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secure Hash Algorithm, 256-bit&lt;/td&gt;
&lt;td&gt;A fingerprint function: any file maps to a unique 64-character hash; change one byte and the hash changes completely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Portable Document Format&lt;/td&gt;
&lt;td&gt;The document format whose text (and scans) the pipeline extracts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;Meta&amp;rsquo;s vector-search library — where the embeddings actually live&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The text format used for the chunk-metadata file stored next to the FAISS index&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Artificial Intelligence&lt;/td&gt;
&lt;td&gt;Software performing tasks that normally need human intelligence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Graphics Processing Unit&lt;/td&gt;
&lt;td&gt;The hardware that makes local model inference fast — the subject of Part 3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;</description></item><item><title>Part 4 · Crash-Resumable Ingestion: DBOS, SHA-256, and Surviving a kill -9</title><link>https://aretascodes.dev/blog/crash-resumable-ingestion-dbos/</link><pubDate>Tue, 05 May 2026 00:00:00 +0000</pubDate><guid>https://aretascodes.dev/blog/crash-resumable-ingestion-dbos/</guid><description>
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;Part of a series on building
. Previously:
.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;All abbreviations are fully explained in the appendix at the bottom of the page.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There are two things you absolutely don&amp;rsquo;t want your RAG ingestion pipeline to do:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Re-embed a 200-page PDF because you fixed a typo on page 12.&lt;/li&gt;
&lt;li&gt;Lose its progress if you close the laptop lid halfway through.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The first wastes time and compute resources. The second leads to distrust in the system. Both have the same root: ingestion is treated like a fire-and-forget function, when it&amp;rsquo;s actually a long-running pipeline with intermediate state worth preserving.&lt;/p&gt;
&lt;p&gt;CogniVault treats ingestion as a &lt;strong&gt;durable workflow&lt;/strong&gt;. Specifically, a
workflow checkpointed in Postgres, with content hashing for incremental work. This post walks through both pieces.&lt;/p&gt;
&lt;h2 id="the-pipeline"&gt;The pipeline&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;1. Scan docs/ → SHA-256 hash per file
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ├── New file → queue for embedding
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; ├── Changed file → soft-delete old chunks, re-embed
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; └── Unchanged → skip (idempotent)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;2. Extract text → per-format extractor (PDF/OCR, DOCX, PPTX, XLSX, MD, CSV, TXT, HTML)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;3. Chunk → RecursiveCharacterTextSplitter (1000 chars, 100 overlap)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;4. Embed → embeddinggemma via Ollama, batches of 5
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;5. Save → append to FAISS IndexFlatIP + JSON metadata on disk
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The heavy stages run as DBOS steps inside one parent workflow, each one checkpointed: if the process dies between steps, the next start picks up at the last completed one.&lt;/p&gt;
&lt;h2 id="sha-256-as-the-source-of-truth"&gt;SHA-256 as the source of truth&lt;/h2&gt;
&lt;p&gt;The naive approach is to track ingestion by filename. That breaks the first time someone edits a file in place. Filename is the same; content isn&amp;rsquo;t. The vector store quietly carries stale chunks.&lt;/p&gt;
&lt;p&gt;The fix is content-addressed: hash the file bytes, store the hash alongside the chunks. Every ingestion run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;stored_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk_metadata_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;file_hash&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stored_hash&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;schedule_ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# new file&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;stored_hash&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;current_hash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;skip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# unchanged&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;soft_delete_chunks_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# changed&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;schedule_ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This gives ingestion an &lt;strong&gt;idempotent&lt;/strong&gt; property that&amp;rsquo;s worth its weight in gold: running the pipeline twice in a row does almost nothing the second time. That&amp;rsquo;s not just an optimisation — it&amp;rsquo;s what makes the next section possible.&lt;/p&gt;
&lt;h2 id="dbos-workflows"&gt;DBOS workflows&lt;/h2&gt;
&lt;p&gt;
is a Python library that turns regular functions into checkpointed workflows backed by Postgres. The model is dead simple: decorate a function with &lt;code&gt;@DBOS.workflow()&lt;/code&gt;, mark each long-running call inside it as a &lt;code&gt;@DBOS.step()&lt;/code&gt;, and DBOS records each step&amp;rsquo;s input, output, and status in Postgres as it runs.&lt;/p&gt;
&lt;p&gt;If the workflow crashes — process killed, OS reboot, Postgres connection drop — the next start sees there&amp;rsquo;s an unfinished workflow with the same ID, replays the &lt;em&gt;recorded&lt;/em&gt; step outputs from Postgres (without re-running them), and resumes from the first incomplete step.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the actual step structure (slightly simplified from &lt;code&gt;backend/services/ingest.py&lt;/code&gt;):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nd"&gt;@DBOS.workflow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ingest_workflow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;filenames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;list_document_files&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# @DBOS.step — scan + hash check&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;process_single_document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# @DBOS.step — extract text, one file each&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# plain Python — fast, re-runs freely&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batches_of_5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;embed_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# @DBOS.step — the slow one, retried on failure&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;save_vector_store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# @DBOS.step — append to FAISS + metadata&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The granularity of &lt;code&gt;@DBOS.step&lt;/code&gt; is the granularity of crash recovery, and it&amp;rsquo;s chosen deliberately. Extraction is one step &lt;strong&gt;per file&lt;/strong&gt;, so a crash during file 9 of 10 doesn&amp;rsquo;t re-read the first eight. Embedding is one step &lt;strong&gt;per batch of five chunks&lt;/strong&gt;, for one specific reason: &lt;strong&gt;&lt;code&gt;embed_batch&lt;/code&gt; is the slow one.&lt;/strong&gt; If the laptop dies during embeddings, we resume the embedding loop at the failed batch, not at PDF extraction.&lt;/p&gt;
&lt;p&gt;Notice what &lt;em&gt;isn&amp;rsquo;t&lt;/em&gt; a step: chunking. Splitting text is fast pure-Python work — checkpointing it would cost more ledger bookkeeping than simply redoing it on a resume.&lt;/p&gt;
&lt;p&gt;There&amp;rsquo;s a related sizing trick hiding in the batch number. DBOS records each step&amp;rsquo;s output in Postgres, and &lt;code&gt;embed_batch&lt;/code&gt; returns its vectors — so each ledger entry contains five embeddings&amp;rsquo; worth of floats. Small batches keep each checkpoint record small and each retry cheap. One giant &amp;ldquo;embed everything&amp;rdquo; step would mean one giant ledger row and zero resume granularity.&lt;/p&gt;
&lt;h2 id="the-format-extractors"&gt;The format extractors&lt;/h2&gt;
&lt;p&gt;Step 2 (&lt;code&gt;process_single_document&lt;/code&gt;) is a dispatch on file extension. Each extractor is small and obvious; the interesting choices are in the chunking strategy each one feeds downstream.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Chunking note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pypdf&lt;/code&gt; page-by-page; &lt;code&gt;pytesseract&lt;/code&gt; OCR fallback for image-only pages&lt;/td&gt;
&lt;td&gt;Recursive splitter, 1000/100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DOCX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;python-docx&lt;/code&gt; (paragraphs + table rows joined as text)&lt;/td&gt;
&lt;td&gt;Recursive splitter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PPTX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;python-pptx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;One chunk per slide (title + body text)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XLSX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;openpyxl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Header + 20-row batches, per sheet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MarkdownHeaderTextSplitter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;One chunk per H1/H2/H3 section, breadcrumb prepended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CSV&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;manual reader&lt;/td&gt;
&lt;td&gt;Header row + 20-row batches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TXT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;raw UTF-8 read&lt;/td&gt;
&lt;td&gt;Recursive splitter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trafilatura&lt;/code&gt; clean text&lt;/td&gt;
&lt;td&gt;Recursive splitter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The OCR fallback is the one worth pausing on. PDFs come in two flavours: ones with a real text layer, and ones that are basically scanned images wearing a PDF costume. &lt;code&gt;pypdf&lt;/code&gt; returns &lt;em&gt;nothing useful&lt;/em&gt; for the second kind, but it doesn&amp;rsquo;t raise — it just hands back empty strings. Without a fallback, your &amp;ldquo;ingestion succeeded&amp;rdquo; log is lying to you.&lt;/p&gt;
&lt;p&gt;The detector is a heuristic: if &lt;code&gt;pypdf&lt;/code&gt; returns fewer than 50 characters for a page, route the page through &lt;code&gt;pymupdf&lt;/code&gt; → &lt;code&gt;Pillow&lt;/code&gt; → &lt;code&gt;pytesseract&lt;/code&gt; OCR. Slower, but at least produces text. The threshold is tuned to be sensitive enough to catch scanned pages while not punishing legitimately short pages (a chapter cover, a colophon).&lt;/p&gt;
&lt;h2 id="soft-delete-not-hard-delete"&gt;Soft delete, not hard delete&lt;/h2&gt;
&lt;p&gt;When a file changes and we re-ingest, the old chunks need to go. The temptation is to physically remove them from the FAISS index, but FAISS &lt;code&gt;IndexFlatIP&lt;/code&gt; doesn&amp;rsquo;t support efficient delete — you&amp;rsquo;d have to rebuild.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Soft delete&lt;/strong&gt; instead: changed files get their old chunks marked with a &lt;code&gt;deleted: true&lt;/code&gt; flag in the metadata; new chunks are appended without it. Search filters on the flag at query time, so stale vectors sit harmlessly in the index. If enough dead weight ever accumulates, the escape valve is obvious — rebuild the index from active chunks only — but in practice I haven&amp;rsquo;t needed it.&lt;/p&gt;
&lt;p&gt;This is the same pattern most append-only systems use. It pairs naturally with content hashing — flag-and-append is much cheaper than remove-and-rebuild. One subtlety: the keyword index has to follow suit. CogniVault&amp;rsquo;s &lt;code&gt;VectorDB.delete_by_source()&lt;/code&gt; flips the flags &lt;strong&gt;and rebuilds BM25&lt;/strong&gt; over the remaining active chunks, so the two retrievers never disagree about what exists.&lt;/p&gt;
&lt;h2 id="what-the-user-sees"&gt;What the user sees&lt;/h2&gt;
&lt;p&gt;Starting an ingestion (&lt;code&gt;POST /ingest&lt;/code&gt;) returns a &lt;code&gt;workflow_id&lt;/code&gt;, and the frontend polls &lt;code&gt;GET /ingest/status/{workflow_id}&lt;/code&gt; to draw a live timeline of the workflow&amp;rsquo;s steps — scanning, per-file extraction (&amp;ldquo;Reading pages… 3 of 21&amp;rdquo;), embedding (&amp;ldquo;Calibrating batch 4 of 12&amp;rdquo;), saving. If the user closes the tab mid-ingest, comes back five minutes later, and reopens — the workflow finished in the background regardless. The next call to &lt;code&gt;GET /api/vault/stats&lt;/code&gt; reflects the new chunk count. No &amp;ldquo;click to resume&amp;rdquo; button, no manual recovery dance.&lt;/p&gt;
&lt;p&gt;The first time I closed the lid mid-embedding and watched the workflow pick itself up from the next step on resume, I&amp;rsquo;ll admit I was a little smug. That&amp;rsquo;s exactly the property I wanted, with surprisingly little code.&lt;/p&gt;
&lt;h2 id="pitfalls-and-edges"&gt;Pitfalls and edges&lt;/h2&gt;
&lt;p&gt;A few things I had to learn the hard way:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Don&amp;rsquo;t make &lt;code&gt;embed_batch&lt;/code&gt; too big.&lt;/strong&gt; Ollama isn&amp;rsquo;t great at backpressure. Batches of 5 are a sweet spot for &lt;code&gt;embeddinggemma&lt;/code&gt; on a 16 GB machine — bigger batches stall on memory, smaller ones waste round-trip overhead. (And as noted above, the batch size doubles as your checkpoint-record size.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Be careful with file deletion.&lt;/strong&gt; Soft-deleted chunks must also disappear from BM25&amp;rsquo;s corpus, or keyword search will keep returning text that dense search no longer sees. Rebuilding BM25 inside &lt;code&gt;delete_by_source()&lt;/code&gt; keeps the two in lockstep.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OCR is slow.&lt;/strong&gt; A 50-page scan can take a minute or more. Surface that latency to the user; otherwise they think it&amp;rsquo;s hanging.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="takeaway"&gt;Takeaway&lt;/h2&gt;
&lt;p&gt;Durable workflows aren&amp;rsquo;t only for distributed systems. A single-user local app benefits from them in &lt;em&gt;exactly the same ways&lt;/em&gt;: incremental work, crash recovery, idempotent retries. DBOS makes the cost of opting in trivially low — decorate your function, run Postgres locally, and you get a pipeline that survives lid-closes, OS updates, and your own &lt;code&gt;Ctrl-C&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Combined with content-addressed hashing, ingestion stops being a thing you avoid touching for fear of having to wait 20 minutes. It becomes a thing you re-run whenever you feel like it — because re-running is free when nothing has changed.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="appendix-abbreviations-in-this-post"&gt;Appendix: Abbreviations in this post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Abbreviation&lt;/th&gt;
&lt;th&gt;Full form&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database-Oriented Operating System&lt;/td&gt;
&lt;td&gt;A library that checkpoints workflow steps in Postgres so crashed jobs resume instead of restarting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SHA-256&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secure Hash Algorithm, 256-bit&lt;/td&gt;
&lt;td&gt;A content fingerprint: change one byte of a file and the hash changes completely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Generation&lt;/td&gt;
&lt;td&gt;Retrieve relevant passages from your own documents first; let the model answer from them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OCR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optical Character Recognition&lt;/td&gt;
&lt;td&gt;Turning pictures of text (scanned pages) into machine-readable text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FAISS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Facebook AI Similarity Search&lt;/td&gt;
&lt;td&gt;The vector index the embeddings are appended to&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IP&lt;/strong&gt; (in &lt;code&gt;IndexFlatIP&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Inner Product&lt;/td&gt;
&lt;td&gt;FAISS&amp;rsquo;s similarity measure; equals cosine similarity on normalised vectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best Match 25&lt;/td&gt;
&lt;td&gt;The keyword index that must stay in lockstep with FAISS on deletes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF / DOCX / PPTX / XLSX / MD / CSV / TXT / HTML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Portable Document Format / Word / PowerPoint / Excel / Markdown / Comma-Separated Values / plain text / HyperText Markup Language&lt;/td&gt;
&lt;td&gt;The formats the per-extension extractors handle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JavaScript Object Notation&lt;/td&gt;
&lt;td&gt;The format of the chunk-metadata file next to the FAISS index&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UTF-8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unicode Transformation Format, 8-bit&lt;/td&gt;
&lt;td&gt;The text encoding used when reading plain-text files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Operating System&lt;/td&gt;
&lt;td&gt;What reboots underneath you mid-ingest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt;
— what happens after Gemma 4 enthusiastically returns &lt;code&gt;{&amp;quot;questions&amp;quot;: [{&amp;quot;text&amp;quot;: &amp;quot;...&amp;quot;},}]&lt;/code&gt;.&lt;/p&gt;</description></item></channel></rss>