CogniVault Backend Explained, Part 2 · From File to Searchable Knowledge

All abbreviations are fully explained in the appendix at the bottom of the page.

An LLM cannot “open” your PDF. That sentence surprises a lot of newcomers, so let’s sit with it for a second: when you chat with your documents in CogniVault, the model never touches the original files. Something has to happen between “I dropped a file into the browser” and “the AI just quoted page 47 back at me.”

That something is ingestion, and it’s the subject of this part. In Part 1 we drew the whole map; today we zoom into one region — the conveyor belt that turns files into searchable knowledge.

The conveyor belt

Think of ingestion as a four-station assembly line:

Extract the text out of each file — even scanned ones.
Chunk it into pieces small enough to fit into a prompt.
Embed each chunk — turn it into a vector (a list of numbers that captures its meaning) so similar ideas land near each other in vector space.
Store vectors and metadata so they can be searched later.

flowchart TD A["Upload
POST /upload
saved to docs/"] --> B subgraph WF["DBOS durable workflow"] B["Step 1
Which files changed?
SHA-256 fingerprints"] --> C["Step 2
Extract text
per-format + OCR fallback"] C --> D["Chunk
1000 chars, 100 overlap"] D --> E["Step 3
Embed
embeddinggemma, batches of 5"] E --> F["Step 4
Save
FAISS index + metadata JSON"] end F --> G["Reload in-memory index
instantly searchable"]

Simple enough. The interesting engineering is in the failure cases — so let’s start there.

The factory ledger: why the pipeline can’t lose work

Embedding a large library takes minutes. What happens when your laptop goes to sleep at page 800 of a 1,000-page manual? With a plain Python script: everything restarts from page 1.

CogniVault instead writes the pipeline as a DBOS durable workflow. Picture a factory where every station stamps a permanent ledger the moment it finishes a box. If the power cuts out, nobody rebuilds finished boxes — the workers read the ledger and resume at the first unstamped entry.

DBOS is that ledger, and PostgreSQL is the book it’s written in. Each pipeline station is a checkpointed step; on restart, completed steps return their recorded results instantly and execution continues from the first unfinished one. A failed embedding batch is simply retried.

This is also what powers the live progress timeline in the UI: starting an ingestion returns a workflow_id, and the frontend polls a status endpoint that reports which steps have completed, which are running, and which are still waiting.

I wrote a whole deep dive on this mechanism — including what happens when you kill -9 the process mid-ingest — in Crash-Resumable Ingestion: DBOS, SHA-256, and Surviving a kill -9.

Fingerprints, not faith: SHA-256 change detection

Re-embedding your whole library every time you add one file would be wasteful. So before any work happens, the pipeline computes each file’s SHA-256 hash (a content fingerprint — change one character in the file and the fingerprint changes completely) and compares it to the fingerprint stored with the file’s existing chunks:

Never seen before → ingest it.
Fingerprint changed → the old chunks are soft-deleted and the file is re-ingested.
Fingerprint identical → skip it entirely.

Why “soft”-deleted? Because the FAISS index type CogniVault uses cannot remove individual vectors. Stale chunks are just marked deleted: true in the metadata; their vectors stay in the index but every search filters them out. It’s an honest, boring solution — and it never corrupts the index.

Every format gets its own treatment

Here’s a detail that separates a demo from a product. A naive pipeline extracts “all the text” and calls it a day. CogniVault gives each format an extractor that preserves the structure that retrieval will need later:

Format	Strategy
PDF	Page by page, keeping page numbers (those become citations later). Any page yielding fewer than 50 characters is presumed scanned and sent to OCR
Scanned page	The page is rendered to an image at roughly 144 dpi, then Tesseract OCR (Optical Character Recognition — reading text out of images) extracts the words
Markdown	Split on headings; each section chunk gets a breadcrumb prefix like `[Section: Intro > Setup]` so its embedding carries the document hierarchy
CSV	Rows grouped 20 per chunk — and every chunk is prefixed with the header row, so the model always knows the column names
Excel	Same row-group idea per sheet, prefixed `[Sheet: name]`
PowerPoint	One chunk per slide
Word	Paragraphs plus table cells
Web pages	Fetched on request and stripped to clean article text — behind an SSRF guard (Server-Side Request Forgery protection: the server refuses to fetch private or internal addresses)

Ask yourself why the CSV detail matters. If chunk 14 of a spreadsheet is just twenty naked rows of numbers, no search will ever connect it to the question “what was the Q3 budget?” Prefix it with the header row, and the chunk knows it contains budget columns. Structure is retrieval fuel.

Chunking: 1,000 characters with a 100-character safety overlap

Long text is split into pieces of about 1,000 characters, with neighbouring pieces overlapping by 100. The overlap is insurance: a sentence sliced at a chunk boundary still appears whole in one of the two neighbours, so no idea falls into the gap between chunks.

Embedding and saving

Chunks are embedded by embeddinggemma (via Ollama) in batches of five — each chunk becomes one vector. The vectors are normalised and appended to a FAISS index; alongside it, a JSON file records each chunk’s source filename, page number, category, fingerprint, and the text itself. The index holds the numbers; the JSON holds the meaning.

One choice worth highlighting for beginners: this is an exact index, not an approximate one. Many vector databases use ANN (Approximate Nearest Neighbour) shortcuts that trade a little accuracy for speed at massive scale. At personal-library scale you don’t need the trade — CogniVault checks every vector on every search and is still fast.

The whole journey, end to end

%%{init: {'sequence': {'actorFontSize': 28, 'messageFontSize': 24, 'loopTextFontSize': 22, 'noteFontSize': 22}}}%% sequenceDiagram actor U as You participant F as Frontend participant B as FastAPI participant W as DBOS Workflow participant O as Ollama (embeddinggemma) participant V as FAISS + metadata U->>F: Drag and drop a file, pick a category F->>B: POST /upload B->>B: Validate type and size, save to docs/ F->>B: POST /ingest B->>W: Start durable workflow B-->>F: workflow_id loop Poll status F->>B: GET /ingest/status/{workflow_id} B-->>F: Step list (drives the progress timeline) end W->>W: SHA-256 change detection W->>W: Extract text (per format, OCR if scanned) W->>W: Chunk (1000 chars / 100 overlap) W->>O: Embed in batches of 5 O-->>W: Vectors W->>V: Append vectors + metadata B-->>F: SUCCESS — index reloaded F-->>U: "Knowledge Sync Complete"

The takeaway

Ingestion is where most RAG quality is actually won or lost — long before any clever prompting. Page numbers preserved, headers carried into every spreadsheet chunk, scans rescued by OCR, and a ledger that makes the whole thing crash-proof: none of it is glamorous, all of it shows up later as answers that cite the right page.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
DBOS	Database-Oriented Operating System	The library that checkpoints workflow steps in PostgreSQL so crashed jobs resume
SHA-256	Secure Hash Algorithm, 256-bit	A content fingerprint — change one byte of a file and the hash changes completely
OCR	Optical Character Recognition	Reading text out of images — the rescue path for scanned PDF pages
SSRF	Server-Side Request Forgery	An attack where a server is tricked into fetching internal URLs; the URL importer blocks it
FAISS	Facebook AI Similarity Search	The vector index the embeddings are appended to
ANN	Approximate Nearest Neighbour	The accuracy-for-speed shortcut CogniVault deliberately does not take
dpi	Dots Per Inch	Image resolution — scanned pages are rendered at ~144 dpi before OCR
JSON	JavaScript Object Notation	The format of the chunk-metadata file beside the FAISS index
PDF / CSV	Portable Document Format / Comma-Separated Values	Two of the eight-plus supported file formats
API	Application Programming Interface	The endpoints (`/upload`, `/ingest`, `/ingest/status/…`) driving the flow

Next up: Part 3 · How a Question Becomes a Cited Answer — hybrid retrieval, the six-tool agent, and the two-phase stream that shows the model think before it answers.

No results found