CogniVault Backend Explained, Part 2 · From File to Searchable Knowledge
All abbreviations are fully explained in the appendix at the bottom of the page.
An LLM cannot “open” your PDF. That sentence surprises a lot of newcomers, so let’s sit with it for a second: when you chat with your documents in CogniVault, the model never touches the original files. Something has to happen between “I dropped a file into the browser” and “the AI just quoted page 47 back at me.”
That something is ingestion, and it’s the subject of this part. In Part 1 we drew the whole map; today we zoom into one region — the conveyor belt that turns files into searchable knowledge.
The conveyor belt
Think of ingestion as a four-station assembly line:
- Extract the text out of each file — even scanned ones.
- Chunk it into pieces small enough to fit into a prompt.
- Embed each chunk — turn it into a vector (a list of numbers that captures its meaning) so similar ideas land near each other in vector space.
- Store vectors and metadata so they can be searched later.
POST /upload
saved to docs/"] --> B subgraph WF["DBOS durable workflow"] B["Step 1
Which files changed?
SHA-256 fingerprints"] --> C["Step 2
Extract text
per-format + OCR fallback"] C --> D["Chunk
1000 chars, 100 overlap"] D --> E["Step 3
Embed
embeddinggemma, batches of 5"] E --> F["Step 4
Save
FAISS index + metadata JSON"] end F --> G["Reload in-memory index
instantly searchable"]
Simple enough. The interesting engineering is in the failure cases — so let’s start there.
The factory ledger: why the pipeline can’t lose work
Embedding a large library takes minutes. What happens when your laptop goes to sleep at page 800 of a 1,000-page manual? With a plain Python script: everything restarts from page 1.
CogniVault instead writes the pipeline as a DBOS durable workflow. Picture a factory where every station stamps a permanent ledger the moment it finishes a box. If the power cuts out, nobody rebuilds finished boxes — the workers read the ledger and resume at the first unstamped entry.
DBOS is that ledger, and PostgreSQL is the book it’s written in. Each pipeline station is a checkpointed step; on restart, completed steps return their recorded results instantly and execution continues from the first unfinished one. A failed embedding batch is simply retried.
This is also what powers the live progress timeline in the UI: starting an ingestion returns a workflow_id, and the frontend polls a status endpoint that reports which steps have completed, which are running, and which are still waiting.
I wrote a whole deep dive on this mechanism — including what happens when you kill -9 the process mid-ingest — in Crash-Resumable Ingestion: DBOS, SHA-256, and Surviving a kill -9.
Fingerprints, not faith: SHA-256 change detection
Re-embedding your whole library every time you add one file would be wasteful. So before any work happens, the pipeline computes each file’s SHA-256 hash (a content fingerprint — change one character in the file and the fingerprint changes completely) and compares it to the fingerprint stored with the file’s existing chunks:
- Never seen before → ingest it.
- Fingerprint changed → the old chunks are soft-deleted and the file is re-ingested.
- Fingerprint identical → skip it entirely.
Why “soft”-deleted? Because the FAISS index type CogniVault uses cannot remove individual vectors. Stale chunks are just marked deleted: true in the metadata; their vectors stay in the index but every search filters them out. It’s an honest, boring solution — and it never corrupts the index.
Every format gets its own treatment
Here’s a detail that separates a demo from a product. A naive pipeline extracts “all the text” and calls it a day. CogniVault gives each format an extractor that preserves the structure that retrieval will need later:
| Format | Strategy |
|---|---|
| Page by page, keeping page numbers (those become citations later). Any page yielding fewer than 50 characters is presumed scanned and sent to OCR | |
| Scanned page | The page is rendered to an image at roughly 144 dpi, then Tesseract OCR (Optical Character Recognition — reading text out of images) extracts the words |
| Markdown | Split on headings; each section chunk gets a breadcrumb prefix like [Section: Intro > Setup] so its embedding carries the document hierarchy |
| CSV | Rows grouped 20 per chunk — and every chunk is prefixed with the header row, so the model always knows the column names |
| Excel | Same row-group idea per sheet, prefixed [Sheet: name] |
| PowerPoint | One chunk per slide |
| Word | Paragraphs plus table cells |
| Web pages | Fetched on request and stripped to clean article text — behind an SSRF guard (Server-Side Request Forgery protection: the server refuses to fetch private or internal addresses) |
Ask yourself why the CSV detail matters. If chunk 14 of a spreadsheet is just twenty naked rows of numbers, no search will ever connect it to the question “what was the Q3 budget?” Prefix it with the header row, and the chunk knows it contains budget columns. Structure is retrieval fuel.
Chunking: 1,000 characters with a 100-character safety overlap
Long text is split into pieces of about 1,000 characters, with neighbouring pieces overlapping by 100. The overlap is insurance: a sentence sliced at a chunk boundary still appears whole in one of the two neighbours, so no idea falls into the gap between chunks.
Embedding and saving
Chunks are embedded by embeddinggemma (via Ollama) in batches of five — each chunk becomes one vector. The vectors are normalised and appended to a FAISS index; alongside it, a JSON file records each chunk’s source filename, page number, category, fingerprint, and the text itself. The index holds the numbers; the JSON holds the meaning.
One choice worth highlighting for beginners: this is an exact index, not an approximate one. Many vector databases use ANN (Approximate Nearest Neighbour) shortcuts that trade a little accuracy for speed at massive scale. At personal-library scale you don’t need the trade — CogniVault checks every vector on every search and is still fast.
The whole journey, end to end
The takeaway
Ingestion is where most RAG quality is actually won or lost — long before any clever prompting. Page numbers preserved, headers carried into every spreadsheet chunk, scans rescued by OCR, and a ledger that makes the whole thing crash-proof: none of it is glamorous, all of it shows up later as answers that cite the right page.
Appendix: Abbreviations in this post
| Abbreviation | Full form | Meaning |
|---|---|---|
| LLM | Large Language Model | A neural network trained on huge amounts of text that can read and generate language |
| DBOS | Database-Oriented Operating System | The library that checkpoints workflow steps in PostgreSQL so crashed jobs resume |
| SHA-256 | Secure Hash Algorithm, 256-bit | A content fingerprint — change one byte of a file and the hash changes completely |
| OCR | Optical Character Recognition | Reading text out of images — the rescue path for scanned PDF pages |
| SSRF | Server-Side Request Forgery | An attack where a server is tricked into fetching internal URLs; the URL importer blocks it |
| FAISS | Facebook AI Similarity Search | The vector index the embeddings are appended to |
| ANN | Approximate Nearest Neighbour | The accuracy-for-speed shortcut CogniVault deliberately does not take |
| dpi | Dots Per Inch | Image resolution — scanned pages are rendered at ~144 dpi before OCR |
| JSON | JavaScript Object Notation | The format of the chunk-metadata file beside the FAISS index |
| PDF / CSV | Portable Document Format / Comma-Separated Values | Two of the eight-plus supported file formats |
| API | Application Programming Interface | The endpoints (/upload, /ingest, /ingest/status/…) driving the flow |
Next up: Part 3 · How a Question Becomes a Cited Answer — hybrid retrieval, the six-tool agent, and the two-phase stream that shows the model think before it answers.

Related
- CogniVault Backend Explained, Part 1 · Meet the Backend: Three Processes, Four Layers
- CogniVault Backend Explained, Part 3 · How a Question Becomes a Cited Answer
- Part 1 · CogniVault Architecture: Why Standard RAG Isn't Enough (Hybrid Search)
- Part 4 · Crash-Resumable Ingestion: DBOS, SHA-256, and Surviving a kill -9
- CogniVault Backend Explained, Part 4 · Study Tools, Progress, and the Privacy Receipts