Reliability |

Part 5 · Getting Reliable JSON Out of a Local LLM

Sun, 10 May 2026 00:00:00 +0000

Part of a series on building . Previously: .

All abbreviations are fully explained in the appendix at the bottom of the page.

CogniVault’s Study Hub generates four kinds of structured artefacts from your documents: quizzes, multi-lesson workshops, flashcard decks, and mindmaps. All four need the model to return structured JSON, not prose. All four ride on Gemma 4 running locally via Ollama. And all four would fail far too often if I trusted the model to “just return JSON.”

Here’s the defensive pattern that brings that failure rate close to zero — and what to do about the cases that still get through.

The pattern

1. Retrieve → hybrid search restricted by user-selected scope
2. Prompt → strict schema-by-example with explicit count + shape rules
3. Generate → ollama.chat with format="json" (grammar-constrained)
4. Parse → json.loads, tolerant of object / array / fenced shapes,
 with a trailing-comma repair pass
5. Validate → drop malformed items rather than fail the whole batch
6. Retry → the workshop outline retries once with a stronger prompt
7. Persist → SQLite (progress.db) so the user can come back later

Every generator in CogniVault follows it. The interesting moves are 2, 4, and 5.

Step 3: `format="json"` does real work

Ollama exposes a format="json" option that puts the model under a grammar constraint during sampling. The decoder won’t emit tokens that would make the output invalid JSON. It’s not perfect — schemas are bigger than “valid JSON,” and the model can still produce well-formed garbage — but it eliminates the entire class of “the model started writing prose before the closing brace” failures.

If your local-LLM stack supports a grammar option (Ollama, llama.cpp, vLLM, etc.), turn it on. It’s not free (sampling is slightly slower) but the failure-mode improvement is enormous. Without it, you’ll spend most of your error budget on truncated objects.

Step 2: schema-in-prompt that the model can actually obey

format="json" guarantees the shape of the output is JSON. It says nothing about whether the JSON matches your domain schema. That’s the prompt’s job.

The pattern that works for me: instead of dumping a formal JSON Schema and saying “obey this,” include a filled-in example that shows the model the exact shape, plus explicit counts. Here’s the heart of CogniVault’s real quiz template (it lives as an editable Markdown file in backend/prompts/quiz.md):

Output ONLY a single JSON object — no prose, no markdown fences,
no text outside the JSON.

NUMBER OF QUESTIONS: EXACTLY $num_questions. This is a hard requirement.

OUTPUT SCHEMA:
{
 "questions": [
 {
 "type": one of [$types_csv],
 "question": the question text (string, no leading numbering),
 "options": array of strings (length 4 for mcq, length 2 for true_false),
 "correct_index": integer index into options (0-based),
 "explanation": 1-2 sentence explanation of the correct answer
 },
 ... exactly $num_questions entries
 ]
}

A few choices that matter:

Show the shape, don’t describe it. “Each item has a type field” gets ignored more often than the literal example.
Pin the count. “EXACTLY 10” — repeated, in capitals, as a hard requirement — is much more reliable than “around 10.”
Index, don’t repeat. The correct answer is correct_index, an integer pointing into options — not the answer text again. Repeated text invites paraphrase drift (“Paris” vs “Paris, France”), and then your grading comparison breaks.
One artefact per call. I tried generating a full workshop (outline + every lesson) in one call. The model’s quality degrades sharply as the response grows. Splitting into outline-first, lesson-on-demand is the two-pass strategy below.

Step 4: parse, tolerantly

Even with format="json", two parsing problems survive in practice.

The shape surprise. This one bit me in production: I’d assumed the model would return a bare JSON array of questions. With format="json", Gemma consistently returns an object — {"questions": [...]} — and for a while the parser only accepted the array. Result: a 502 on every quiz generation until I found it. The fix is a parser that meets the model where it is:

# Simplified from backend/services/quiz_generator.py
def extract_items(raw: str) -> list | None:
 for candidate in (raw, extract_json_object(raw), extract_json_array(raw)):
 if candidate is None:
 continue
 data = load_json_lenient(candidate)
 if isinstance(data, list):
 return data # bare array
 if isinstance(data, dict):
 items = data.get("questions") # the expected object shape
 if isinstance(items, list):
 return items
 return None

Lexical glitches. Occasionally a trailing comma slips through. The repair is deliberately narrow — one regex pass, then give up:

def load_json_lenient(text: str):
 try:
 return json.loads(text)
 except json.JSONDecodeError:
 repaired = re.sub(r",(\s*[\]}])", r"\1", text) # strip trailing commas
 try:
 return json.loads(repaired)
 except json.JSONDecodeError:
 return None

I don’t try to balance brackets, complete truncated strings, or guess at missing fields. Either the output is fixable with a trailing-comma pass and some substring extraction, or it isn’t, and we move to step 5.

Step 5: drop malformed items, don’t fail the batch

This is the call that took me a while to make peace with.

When the model returns 10 quiz questions but #7 is missing its options field, the temptation is to error out and regenerate the whole batch. Don’t. Validate each item independently and drop the ones that fail.

# CogniVault does this with explicit field checks into a dataclass;
# pydantic works just as well.
questions = []
for raw_item in parsed_items:
 q = validate_item(raw_item, allowed_types) # returns None if malformed
 if q is not None:
 questions.append(q)

The user gets 9 questions instead of 10. They don’t notice. Re-running the whole generation to fix question #7 takes 30 seconds and might introduce new failures in questions 1-6. The dropped-item approach is strictly better UX. (The model also sometimes overshoots the count — the validated list is simply trimmed back to what was asked for.)

Step 6: the outline retries once

Workshops are the exception that proves the rule. A workshop is a structured outline (title, summary, lesson list) plus each lesson’s content. The outline must parse — there’s no partial success for a table of contents — so a parse failure there triggers exactly one retry, with the prompt re-sent plus a stern reminder: “Your previous response was unparseable. Output ONLY a single valid JSON object.” If the second attempt fails too, the user gets a clear error suggesting a narrower scope.

One retry, not three. Three retries when the model is consistently confused is just wasted seconds and watts.

The lessons themselves, interestingly, are not JSON at all. A lesson body is prose — forcing it into a JSON string would buy nothing and cost escaping headaches. Lessons are generated as plain Markdown, then run through a small cleanup pass that strips chat-isms the model sometimes adds despite instructions (“I hope this helps!”, “Let me know if…”). Different output, different contract.

Two-pass: outline first, lessons on demand

Workshops use a two-pass generation pattern:

Pass 1 — generate outline: {"title": ..., "lessons": [{"title": ...}, ...]} (cheap, JSON)
Pass 2 — for each lesson: a full Markdown lesson body (on demand)

The outline is fast and lets the user see the shape of the workshop immediately. Each lesson is generated when the user opens it — meaning the user is reading lesson 1 while deciding whether they even want lesson 5. The total wall-clock time to “first useful content” is small even for a 10-lesson workshop.

This is the same architectural move the chat side makes with : split a slow operation into a tiny fast part and a larger slow part, hand the user the fast part immediately.

What I learned so far putting those generators together

A few principles distilled from the four generators:

Use the grammar option in your inference stack. Don’t try to coax JSON out of a free-form decoder.
Pin every quantifier in the prompt. “Exactly 10,” “exactly 4 options,” “one or two sentences.” Vague counts = inconsistent output.
Don’t assume the top-level shape. Grammar-constrained Gemma likes objects; your code might expect arrays. Accept both — the parser is cheaper than relying on the model to return the expected shape.
Drop, don’t fail. Lossy success beats brittle perfection.
One retry, never more. If two tries can’t produce valid output, the prompt is wrong, not the model.
Split large generations. Outline + lessons. Skeleton + body. Two small calls beat one big one almost every time. And if a part of the output is naturally prose, let it be prose.

Local LLMs in 2026 are good enough that structured generation is genuinely usable for production-shaped features. They are not so good that you can skip the defensive scaffolding. The scaffolding above is maybe 80 lines of code total across all four generators, and it’s the difference between “demo-quality” and “I trust this enough to ship.”

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
JSON	JavaScript Object Notation	The structured text format the generators must produce
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
AI	Artificial Intelligence	Software performing tasks that normally need human intelligence
MCQ	Multiple-Choice Question	One of the two quiz question types (the other is true/false)
UX	User Experience	Why 9 valid questions beat a regeneration error
SQLite	(SQL = Structured Query Language)	The single-file database where generated artefacts persist
DBOS	Database-Oriented Operating System	The durable-workflow library from the previous post
HTTP 502	Bad Gateway (HyperText Transfer Protocol status code)	The error my array-only parser produced until I accepted Gemma’s object shape

Next up: — what hand-rolling an SVG radial layout taught me, and why version two uses React Flow anyway.

Part 4 · Crash-Resumable Ingestion: DBOS, SHA-256, and Surviving a kill -9

Tue, 05 May 2026 00:00:00 +0000

Part of a series on building . Previously: .

All abbreviations are fully explained in the appendix at the bottom of the page.

There are two things you absolutely don’t want your RAG ingestion pipeline to do:

Re-embed a 200-page PDF because you fixed a typo on page 12.
Lose its progress if you close the laptop lid halfway through.

The first wastes time and compute resources. The second leads to distrust in the system. Both have the same root: ingestion is treated like a fire-and-forget function, when it’s actually a long-running pipeline with intermediate state worth preserving.

CogniVault treats ingestion as a durable workflow. Specifically, a workflow checkpointed in Postgres, with content hashing for incremental work. This post walks through both pieces.

The pipeline

1. Scan docs/ → SHA-256 hash per file
 ├── New file → queue for embedding
 ├── Changed file → soft-delete old chunks, re-embed
 └── Unchanged → skip (idempotent)

2. Extract text → per-format extractor (PDF/OCR, DOCX, PPTX, XLSX, MD, CSV, TXT, HTML)
3. Chunk → RecursiveCharacterTextSplitter (1000 chars, 100 overlap)
4. Embed → embeddinggemma via Ollama, batches of 5
5. Save → append to FAISS IndexFlatIP + JSON metadata on disk

The heavy stages run as DBOS steps inside one parent workflow, each one checkpointed: if the process dies between steps, the next start picks up at the last completed one.

SHA-256 as the source of truth

The naive approach is to track ingestion by filename. That breaks the first time someone edits a file in place. Filename is the same; content isn’t. The vector store quietly carries stale chunks.

The fix is content-addressed: hash the file bytes, store the hash alongside the chunks. Every ingestion run:

current_hash = hashlib.sha256(file_bytes).hexdigest()
stored_hash = chunk_metadata_for(filename).get("file_hash")

if stored_hash is None:
 schedule_ingest(filename) # new file
elif stored_hash == current_hash:
 skip(filename) # unchanged
else:
 soft_delete_chunks_for(filename) # changed
 schedule_ingest(filename)

This gives ingestion an idempotent property that’s worth its weight in gold: running the pipeline twice in a row does almost nothing the second time. That’s not just an optimisation — it’s what makes the next section possible.

DBOS workflows

is a Python library that turns regular functions into checkpointed workflows backed by Postgres. The model is dead simple: decorate a function with @DBOS.workflow(), mark each long-running call inside it as a @DBOS.step(), and DBOS records each step’s input, output, and status in Postgres as it runs.

If the workflow crashes — process killed, OS reboot, Postgres connection drop — the next start sees there’s an unfinished workflow with the same ID, replays the recorded step outputs from Postgres (without re-running them), and resumes from the first incomplete step.

Here’s the actual step structure (slightly simplified from backend/services/ingest.py):

@DBOS.workflow()
def ingest_workflow() -> int:
 filenames = list_document_files() # @DBOS.step — scan + hash check
 docs = []
 for name in filenames:
 docs += process_single_document(name) # @DBOS.step — extract text, one file each
 chunks = chunk(docs) # plain Python — fast, re-runs freely
 embeddings = []
 for batch in batches_of_5(chunks):
 embeddings += embed_batch(batch) # @DBOS.step — the slow one, retried on failure
 save_vector_store(embeddings, chunks) # @DBOS.step — append to FAISS + metadata
 return len(chunks)

The granularity of @DBOS.step is the granularity of crash recovery, and it’s chosen deliberately. Extraction is one step per file, so a crash during file 9 of 10 doesn’t re-read the first eight. Embedding is one step per batch of five chunks, for one specific reason: embed_batch is the slow one. If the laptop dies during embeddings, we resume the embedding loop at the failed batch, not at PDF extraction.

Notice what isn’t a step: chunking. Splitting text is fast pure-Python work — checkpointing it would cost more ledger bookkeeping than simply redoing it on a resume.

There’s a related sizing trick hiding in the batch number. DBOS records each step’s output in Postgres, and embed_batch returns its vectors — so each ledger entry contains five embeddings’ worth of floats. Small batches keep each checkpoint record small and each retry cheap. One giant “embed everything” step would mean one giant ledger row and zero resume granularity.

The format extractors

Step 2 (process_single_document) is a dispatch on file extension. Each extractor is small and obvious; the interesting choices are in the chunking strategy each one feeds downstream.

Format	Library	Chunking note
PDF	`pypdf` page-by-page; `pytesseract` OCR fallback for image-only pages	Recursive splitter, 1000/100
DOCX	`python-docx` (paragraphs + table rows joined as text)	Recursive splitter
PPTX	`python-pptx`	One chunk per slide (title + body text)
XLSX	`openpyxl`	Header + 20-row batches, per sheet
MD	`MarkdownHeaderTextSplitter`	One chunk per H1/H2/H3 section, breadcrumb prepended
CSV	manual reader	Header row + 20-row batches
TXT	raw UTF-8 read	Recursive splitter
HTML	`trafilatura` clean text	Recursive splitter

The OCR fallback is the one worth pausing on. PDFs come in two flavours: ones with a real text layer, and ones that are basically scanned images wearing a PDF costume. pypdf returns nothing useful for the second kind, but it doesn’t raise — it just hands back empty strings. Without a fallback, your “ingestion succeeded” log is lying to you.

The detector is a heuristic: if pypdf returns fewer than 50 characters for a page, route the page through pymupdf → Pillow → pytesseract OCR. Slower, but at least produces text. The threshold is tuned to be sensitive enough to catch scanned pages while not punishing legitimately short pages (a chapter cover, a colophon).

Soft delete, not hard delete

When a file changes and we re-ingest, the old chunks need to go. The temptation is to physically remove them from the FAISS index, but FAISS IndexFlatIP doesn’t support efficient delete — you’d have to rebuild.

Soft delete instead: changed files get their old chunks marked with a deleted: true flag in the metadata; new chunks are appended without it. Search filters on the flag at query time, so stale vectors sit harmlessly in the index. If enough dead weight ever accumulates, the escape valve is obvious — rebuild the index from active chunks only — but in practice I haven’t needed it.

This is the same pattern most append-only systems use. It pairs naturally with content hashing — flag-and-append is much cheaper than remove-and-rebuild. One subtlety: the keyword index has to follow suit. CogniVault’s VectorDB.delete_by_source() flips the flags and rebuilds BM25 over the remaining active chunks, so the two retrievers never disagree about what exists.

What the user sees

Starting an ingestion (POST /ingest) returns a workflow_id, and the frontend polls GET /ingest/status/{workflow_id} to draw a live timeline of the workflow’s steps — scanning, per-file extraction (“Reading pages… 3 of 21”), embedding (“Calibrating batch 4 of 12”), saving. If the user closes the tab mid-ingest, comes back five minutes later, and reopens — the workflow finished in the background regardless. The next call to GET /api/vault/stats reflects the new chunk count. No “click to resume” button, no manual recovery dance.

The first time I closed the lid mid-embedding and watched the workflow pick itself up from the next step on resume, I’ll admit I was a little smug. That’s exactly the property I wanted, with surprisingly little code.

Pitfalls and edges

A few things I had to learn the hard way:

Don’t make embed_batch too big. Ollama isn’t great at backpressure. Batches of 5 are a sweet spot for embeddinggemma on a 16 GB machine — bigger batches stall on memory, smaller ones waste round-trip overhead. (And as noted above, the batch size doubles as your checkpoint-record size.)
Be careful with file deletion. Soft-deleted chunks must also disappear from BM25’s corpus, or keyword search will keep returning text that dense search no longer sees. Rebuilding BM25 inside delete_by_source() keeps the two in lockstep.
OCR is slow. A 50-page scan can take a minute or more. Surface that latency to the user; otherwise they think it’s hanging.

Takeaway

Durable workflows aren’t only for distributed systems. A single-user local app benefits from them in exactly the same ways: incremental work, crash recovery, idempotent retries. DBOS makes the cost of opting in trivially low — decorate your function, run Postgres locally, and you get a pipeline that survives lid-closes, OS updates, and your own Ctrl-C.

Combined with content-addressed hashing, ingestion stops being a thing you avoid touching for fear of having to wait 20 minutes. It becomes a thing you re-run whenever you feel like it — because re-running is free when nothing has changed.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
DBOS	Database-Oriented Operating System	A library that checkpoints workflow steps in Postgres so crashed jobs resume instead of restarting
SHA-256	Secure Hash Algorithm, 256-bit	A content fingerprint: change one byte of a file and the hash changes completely
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them
OCR	Optical Character Recognition	Turning pictures of text (scanned pages) into machine-readable text
FAISS	Facebook AI Similarity Search	The vector index the embeddings are appended to
IP (in `IndexFlatIP`)	Inner Product	FAISS’s similarity measure; equals cosine similarity on normalised vectors
BM25	Best Match 25	The keyword index that must stay in lockstep with FAISS on deletes
PDF / DOCX / PPTX / XLSX / MD / CSV / TXT / HTML	Portable Document Format / Word / PowerPoint / Excel / Markdown / Comma-Separated Values / plain text / HyperText Markup Language	The formats the per-extension extractors handle
JSON	JavaScript Object Notation	The format of the chunk-metadata file next to the FAISS index
UTF-8	Unicode Transformation Format, 8-bit	The text encoding used when reading plain-text files
OS	Operating System	What reboots underneath you mid-ingest

Next up: — what happens after Gemma 4 enthusiastically returns {"questions": [{"text": "..."},}].

Reliability |

Part 5 · Getting Reliable JSON Out of a Local LLM

The pattern

Step 3: format="json" does real work

Step 2: schema-in-prompt that the model can actually obey

Step 4: parse, tolerantly

Step 5: drop malformed items, don’t fail the batch

Step 6: the outline retries once

Two-pass: outline first, lessons on demand

What I learned so far putting those generators together

Appendix: Abbreviations in this post

Part 4 · Crash-Resumable Ingestion: DBOS, SHA-256, and Surviving a kill -9

The pipeline

SHA-256 as the source of truth

DBOS workflows

The format extractors

Soft delete, not hard delete

What the user sees

Pitfalls and edges

Takeaway

Appendix: Abbreviations in this post

Step 3: `format="json"` does real work