Gemma |

Part 5 · Getting Reliable JSON Out of a Local LLM

Sun, 10 May 2026 00:00:00 +0000

Part of a series on building . Previously: .

All abbreviations are fully explained in the appendix at the bottom of the page.

CogniVault’s Study Hub generates four kinds of structured artefacts from your documents: quizzes, multi-lesson workshops, flashcard decks, and mindmaps. All four need the model to return structured JSON, not prose. All four ride on Gemma 4 running locally via Ollama. And all four would fail far too often if I trusted the model to “just return JSON.”

Here’s the defensive pattern that brings that failure rate close to zero — and what to do about the cases that still get through.

The pattern

1. Retrieve → hybrid search restricted by user-selected scope
2. Prompt → strict schema-by-example with explicit count + shape rules
3. Generate → ollama.chat with format="json" (grammar-constrained)
4. Parse → json.loads, tolerant of object / array / fenced shapes,
 with a trailing-comma repair pass
5. Validate → drop malformed items rather than fail the whole batch
6. Retry → the workshop outline retries once with a stronger prompt
7. Persist → SQLite (progress.db) so the user can come back later

Every generator in CogniVault follows it. The interesting moves are 2, 4, and 5.

Step 3: `format="json"` does real work

Ollama exposes a format="json" option that puts the model under a grammar constraint during sampling. The decoder won’t emit tokens that would make the output invalid JSON. It’s not perfect — schemas are bigger than “valid JSON,” and the model can still produce well-formed garbage — but it eliminates the entire class of “the model started writing prose before the closing brace” failures.

If your local-LLM stack supports a grammar option (Ollama, llama.cpp, vLLM, etc.), turn it on. It’s not free (sampling is slightly slower) but the failure-mode improvement is enormous. Without it, you’ll spend most of your error budget on truncated objects.

Step 2: schema-in-prompt that the model can actually obey

format="json" guarantees the shape of the output is JSON. It says nothing about whether the JSON matches your domain schema. That’s the prompt’s job.

The pattern that works for me: instead of dumping a formal JSON Schema and saying “obey this,” include a filled-in example that shows the model the exact shape, plus explicit counts. Here’s the heart of CogniVault’s real quiz template (it lives as an editable Markdown file in backend/prompts/quiz.md):

Output ONLY a single JSON object — no prose, no markdown fences,
no text outside the JSON.

NUMBER OF QUESTIONS: EXACTLY $num_questions. This is a hard requirement.

OUTPUT SCHEMA:
{
 "questions": [
 {
 "type": one of [$types_csv],
 "question": the question text (string, no leading numbering),
 "options": array of strings (length 4 for mcq, length 2 for true_false),
 "correct_index": integer index into options (0-based),
 "explanation": 1-2 sentence explanation of the correct answer
 },
 ... exactly $num_questions entries
 ]
}

A few choices that matter:

Show the shape, don’t describe it. “Each item has a type field” gets ignored more often than the literal example.
Pin the count. “EXACTLY 10” — repeated, in capitals, as a hard requirement — is much more reliable than “around 10.”
Index, don’t repeat. The correct answer is correct_index, an integer pointing into options — not the answer text again. Repeated text invites paraphrase drift (“Paris” vs “Paris, France”), and then your grading comparison breaks.
One artefact per call. I tried generating a full workshop (outline + every lesson) in one call. The model’s quality degrades sharply as the response grows. Splitting into outline-first, lesson-on-demand is the two-pass strategy below.

Step 4: parse, tolerantly

Even with format="json", two parsing problems survive in practice.

The shape surprise. This one bit me in production: I’d assumed the model would return a bare JSON array of questions. With format="json", Gemma consistently returns an object — {"questions": [...]} — and for a while the parser only accepted the array. Result: a 502 on every quiz generation until I found it. The fix is a parser that meets the model where it is:

# Simplified from backend/services/quiz_generator.py
def extract_items(raw: str) -> list | None:
 for candidate in (raw, extract_json_object(raw), extract_json_array(raw)):
 if candidate is None:
 continue
 data = load_json_lenient(candidate)
 if isinstance(data, list):
 return data # bare array
 if isinstance(data, dict):
 items = data.get("questions") # the expected object shape
 if isinstance(items, list):
 return items
 return None

Lexical glitches. Occasionally a trailing comma slips through. The repair is deliberately narrow — one regex pass, then give up:

def load_json_lenient(text: str):
 try:
 return json.loads(text)
 except json.JSONDecodeError:
 repaired = re.sub(r",(\s*[\]}])", r"\1", text) # strip trailing commas
 try:
 return json.loads(repaired)
 except json.JSONDecodeError:
 return None

I don’t try to balance brackets, complete truncated strings, or guess at missing fields. Either the output is fixable with a trailing-comma pass and some substring extraction, or it isn’t, and we move to step 5.

Step 5: drop malformed items, don’t fail the batch

This is the call that took me a while to make peace with.

When the model returns 10 quiz questions but #7 is missing its options field, the temptation is to error out and regenerate the whole batch. Don’t. Validate each item independently and drop the ones that fail.

# CogniVault does this with explicit field checks into a dataclass;
# pydantic works just as well.
questions = []
for raw_item in parsed_items:
 q = validate_item(raw_item, allowed_types) # returns None if malformed
 if q is not None:
 questions.append(q)

The user gets 9 questions instead of 10. They don’t notice. Re-running the whole generation to fix question #7 takes 30 seconds and might introduce new failures in questions 1-6. The dropped-item approach is strictly better UX. (The model also sometimes overshoots the count — the validated list is simply trimmed back to what was asked for.)

Step 6: the outline retries once

Workshops are the exception that proves the rule. A workshop is a structured outline (title, summary, lesson list) plus each lesson’s content. The outline must parse — there’s no partial success for a table of contents — so a parse failure there triggers exactly one retry, with the prompt re-sent plus a stern reminder: “Your previous response was unparseable. Output ONLY a single valid JSON object.” If the second attempt fails too, the user gets a clear error suggesting a narrower scope.

One retry, not three. Three retries when the model is consistently confused is just wasted seconds and watts.

The lessons themselves, interestingly, are not JSON at all. A lesson body is prose — forcing it into a JSON string would buy nothing and cost escaping headaches. Lessons are generated as plain Markdown, then run through a small cleanup pass that strips chat-isms the model sometimes adds despite instructions (“I hope this helps!”, “Let me know if…”). Different output, different contract.

Two-pass: outline first, lessons on demand

Workshops use a two-pass generation pattern:

Pass 1 — generate outline: {"title": ..., "lessons": [{"title": ...}, ...]} (cheap, JSON)
Pass 2 — for each lesson: a full Markdown lesson body (on demand)

The outline is fast and lets the user see the shape of the workshop immediately. Each lesson is generated when the user opens it — meaning the user is reading lesson 1 while deciding whether they even want lesson 5. The total wall-clock time to “first useful content” is small even for a 10-lesson workshop.

This is the same architectural move the chat side makes with : split a slow operation into a tiny fast part and a larger slow part, hand the user the fast part immediately.

What I learned so far putting those generators together

A few principles distilled from the four generators:

Use the grammar option in your inference stack. Don’t try to coax JSON out of a free-form decoder.
Pin every quantifier in the prompt. “Exactly 10,” “exactly 4 options,” “one or two sentences.” Vague counts = inconsistent output.
Don’t assume the top-level shape. Grammar-constrained Gemma likes objects; your code might expect arrays. Accept both — the parser is cheaper than relying on the model to return the expected shape.
Drop, don’t fail. Lossy success beats brittle perfection.
One retry, never more. If two tries can’t produce valid output, the prompt is wrong, not the model.
Split large generations. Outline + lessons. Skeleton + body. Two small calls beat one big one almost every time. And if a part of the output is naturally prose, let it be prose.

Local LLMs in 2026 are good enough that structured generation is genuinely usable for production-shaped features. They are not so good that you can skip the defensive scaffolding. The scaffolding above is maybe 80 lines of code total across all four generators, and it’s the difference between “demo-quality” and “I trust this enough to ship.”

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
JSON	JavaScript Object Notation	The structured text format the generators must produce
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
AI	Artificial Intelligence	Software performing tasks that normally need human intelligence
MCQ	Multiple-Choice Question	One of the two quiz question types (the other is true/false)
UX	User Experience	Why 9 valid questions beat a regeneration error
SQLite	(SQL = Structured Query Language)	The single-file database where generated artefacts persist
DBOS	Database-Oriented Operating System	The durable-workflow library from the previous post
HTTP 502	Bad Gateway (HyperText Transfer Protocol status code)	The error my array-only parser produced until I accepted Gemma’s object shape

Next up: — what hand-rolling an SVG radial layout taught me, and why version two uses React Flow anyway.

Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts

Thu, 30 Apr 2026 00:00:00 +0000

Part of a series on building . Previously: . All abbreviations are fully explained in the appendix at the bottom of the page.

When I first wired up Gemma 4 with inside CogniVault, the chat felt slow. Not laggy — slow in a way that’s worse than laggy. The user types a question. The cursor sits there. Then, eventually, an answer drops out of the void.

The model wasn’t idle. It was thinking. Gemma 4 has a chain-of-thought mode that produces a (sometimes long) reasoning trace before its final reply. With a single-phase agent stream, all of that thinking is happening inside the agent loop — silently — before any tool calls run, before any tokens get emitted to the UI.

So I split the call into two phases.

The shape

POST /rag
 │
 ├── Phase 1 — Direct Ollama call, thinking enabled
 │ stream: {"type":"thinking","data":"..."} (reasoning tokens)
 │
 └── Phase 2 — Strands Agent (thinking disabled)
 stream: {"type":"metadata","data":{...}} (citations, as soon as search runs)
 stream: {"type":"text","data":"..."} (answer tokens)
 stream: {"type":"memory","data":{...}} (end-of-stream: session memory usage)

The endpoint streams newline-delimited JSON (NDJSON): each line of the response body is one self-contained JSON envelope with a type and a data. The frontend dispatches on type and renders accordingly: a collapsible reasoning panel for the thinking tokens, the main message bubble for the text tokens, a sidebar card per citation.

The user sees the model start thinking immediately. Latency to first byte drops from “long enough to wonder if it crashed” to “instant.” Total time to final answer doesn’t change. Perceived speed does.

Phase 1 — Thinking only

Phase 1 is a single direct call to Ollama with thinking enabled. It gets exactly what Phase 2 will see — the same system prompt, the current question, and any attached images — so the reasoning reflects reality. Only the reasoning tokens are consumed; whatever answer text Phase 1 starts to produce is discarded, because we don’t want a half-formed answer competing with the real one.

# Simplified from backend/services/rag_agent.py
client = ollama.AsyncClient(host=settings.ollama_host)
stream = await client.chat(
 model=settings.llm_model,
 messages=[
 {"role": "system", "content": system_prompt},
 {"role": "user", "content": query, "images": images},
 ],
 options={"thinking": True},
 stream=True,
)
async for chunk in stream:
 if chunk.message.thinking:
 yield envelope("thinking", chunk.message.thinking)

Phase 1 is deliberately best-effort: any failure here is swallowed and logged, and the stream moves straight on to Phase 2. A broken reasoning panel should never cost the user their answer.

Phase 2 — Agent with tools

Phase 2 builds a fresh Strands Agent per request — no shared mutable state between concurrent chats — restores the session’s conversation history into it, and runs the tool loop with six tools registered:

Tool	Purpose
`search_knowledge_base(query)`	Hybrid FAISS + BM25 search, top-7, RRF fusion. Scope-filter-aware.
`list_documents()`	Inventory of every indexed file with type and chunk count.
`analyze_document(filename)`	Inner Gemma call → structured summary (topics, entities, key facts).
`compare_documents(doc_a, doc_b, question)`	Inner Gemma call answering across two documents.
`calculator(expression)`	Safe AST evaluator — no `eval()`, no arbitrary code.
`current_time()`	Timestamp for time-aware queries.

The agent decides which tools to call and in what order. There’s no hard-coded router; the system prompt explains what’s available and Strands handles the loop. For most document questions the path is: search_knowledge_base → answer. For comparisons: compare_documents → answer. For “what files do I have?”: list_documents → answer. For greetings and arithmetic, the system prompt tells the agent it may skip search entirely. The model picks.

Two details that took debugging to get right:

Phase 2 runs with thinking explicitly disabled. Without that flag, Gemma’s default behaviour can leak <think>…</think> tags into the visible answer, and everything before the closing tag gets swallowed by the Markdown renderer. One model option — options={"thinking": False} — fixed a “truncated responses” bug that looked much scarier than it was.
Citations are flushed before the first answer token. Tools run before text deltas arrive, so by the time the first visible token streams, every source the search found is already in the sidebar. The accumulator is a request-local ContextVar the search tool appends to.

# Simplified — the real loop reads Strands' raw event dicts
async for event in agent.stream_async(user_input):
 delta = event["event"].get("contentBlockDelta", {}).get("delta", {}).get("text")
 if delta:
 for doc in new_citations(): # drain the ContextVar accumulator
 yield envelope("metadata", doc)
 yield envelope("text", delta)

Why this matters more than it sounds

You could implement similar behaviour with one agent call that interleaves thinking events with text events. The reasons I split it anyway:

The thinking model and the tool model can be different. Right now they’re both gemma4:e4b, but the architecture lets me swap a smaller, faster model in for Phase 1 reasoning and keep the big one for Phase 2 tool use. I’m not doing that yet — but I want the option.
Phase 1 always streams immediately. A pure agent loop only starts producing tokens after the model has decided what to say. Two-phase guarantees the user sees activity almost as soon as they press Enter, regardless of how complex the Phase 2 tool work gets.
Failures isolate. If Phase 2 falls over (Ollama timeout, tool error), Phase 1’s reasoning is still visible — the user can see what the model was trying to do, which makes the error far less frustrating than a blank “something went wrong.”

ContextVar isolation, again

The same ContextVar trick that scopes retrieval in carries here. At the start of each /rag stream, the handler sets two request-local variables: the document-scope filter and the citation accumulator. The agent’s tools read and write them implicitly. Conversation history itself lives in a per-session store guarded by per-session asyncio locks, so two concurrent requests in the same chat can’t corrupt each other either.

Tested with two browser tabs open on the same backend, scoped to different document categories, sending overlapping queries simultaneously. Zero cross-contamination. The test suite covers this explicitly in test_thinking.py and test_doc_scope_filter.py — see for the broader story.

The frontend side of the contract

A detail that tripped me up: this is a POST endpoint, so the browser’s EventSource API (which only does GET) is out. The frontend uses fetch and reads the response body incrementally, splitting on newlines and parsing each line as JSON:

// Simplified from useRagStream.ts
const res = await fetch("/rag", {
 method: "POST",
 body: JSON.stringify(payload),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
 const { done, value } = await reader.read();
 if (done) break;
 buffer += decoder.decode(value, { stream: true });
 const lines = buffer.split("\n");
 buffer = lines.pop()!; // keep the trailing partial line
 for (const line of lines) {
 if (!line.trim()) continue;
 const { type, data } = JSON.parse(line);
 switch (type) {
 case "thinking":
 appendThinking(data);
 break;
 case "text":
 appendText(data);
 break;
 case "metadata":
 addCitation(data);
 break;
 case "memory":
 updateMemoryMeter(data);
 break;
 }
 }
}

The reasoning panel starts collapsed, with a small pulsing indicator while thinking tokens are still streaming — enough to signal “the model is working” without shoving a wall of chain-of-thought at the user. One click expands the full trace, during or after the stream.

What I’d revisit

Phase 1 reasons toward a full answer, and we throw the answer part away. A dedicated “plan your approach, don’t answer yet” prompt for Phase 1 would make the reasoning trace tighter and cheaper. Today it shares the main system prompt — simpler, but the trace can ramble.
No interrupt yet. Once Phase 1 starts, it runs to completion. If the user types a follow-up mid-stream we let it finish. A real cancel button would mean wiring an abort signal through Ollama’s HTTP client — feasible, not yet done.
Phase 1 occasionally over-thinks. Greetings and trivial questions still produce a paragraph of reasoning. A “should I think?” gate (probably a tiny classifier or even a heuristic on query length) would skip Phase 1 entirely for those cases.

Takeaway

Streaming is not just an optimisation. It’s a UX primitive. Two-phase streaming buys you a free property: the visible part of the interaction starts before the slow part does. The user gets to watch the model think, which is — genuinely — more interesting than watching a spinner.

If your agent app feels slow even though the answers are fast, look at when tokens start flowing. The fix often isn’t a faster model.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
NDJSON	Newline-Delimited JSON	A stream where each line is its own complete JSON object — what `/rag` emits
JSON	JavaScript Object Notation	The universal text format for structured data
UX	User Experience	How the product feels to use — the real beneficiary of two-phase streaming
UI	User Interface	The visible surface the stream renders into
FAISS	Facebook AI Similarity Search	The dense half of hybrid retrieval (previous post)
BM25	Best Match 25	The keyword half of hybrid retrieval (previous post)
RRF	Reciprocal Rank Fusion	The rank-only formula that merges the two result lists
AST	Abstract Syntax Tree	The parsed form of an expression — how the calculator evaluates maths without `eval()`
HTTP	HyperText Transfer Protocol	The protocol carrying the stream
SSE	Server-Sent Events	The browser’s built-in GET-only streaming format — notably not usable here, because `/rag` is a POST
API	Application Programming Interface	The boundary the frontend calls

Next up: — how CogniVault re-ingests edited PDFs without re-embedding everything, and survives a kill -9 mid-pipeline.

Part 1 · Why I Built a Local-First RAG

Mon, 20 Apr 2026 00:00:00 +0000

All abbreviations are fully explained in the appendix at the bottom of the page.

I’ve spent the last few years in front of virtual classrooms full of career-changers in Germany, walking them through programming basics, web development, and introductory AI courses. Most of the information we deal with is fine to paste into cloud-based AI tools. Some of it really isn’t.

Exam materials under confidentiality. A trainee’s portfolio with personal details. Other private documents that should never end up training someone else’s model.

So I built — a fully local AI study and productivity tool. No cloud. No telemetry. No “we may use this data to improve our service.” Just Gemma 4 running on Ollama, on my laptop, talking to my files.

The leaky abstraction

The pitch for cloud AI is great: a giant model, available instantly, billed by the token. The fine print is where it gets uncomfortable:

Where does the data physically live during inference?
Whose jurisdiction governs that hardware this afternoon?
Does the audit trail stop at the API boundary, or can you actually trace what happened to your bytes?
When you tick “do not train on my data,” are you trusting a control, a contract, or both?

For most consumer use cases, those questions are fine to wave away. For education, healthcare, finance, legal, public administration — the answer “trust us” isn’t an answer.

What “local-first” actually means here

Lots of products say “private.” I wanted three concrete properties:

The model lives on your machine. Gemma 4 (gemma4:e4b) and embeddinggemma are pulled via Ollama. Inference is a localhost HTTP call.
Your documents never leave. Vectors, chunks, chat history, study sessions, achievements — all on disk on your computer.
You can verify it. Gemma CogniVault ships a Privacy Audit Panel that shows a live “zero external connections” indicator alongside document counts and the Ollama host. It’s not a promise — it’s a status light.

If a future build of Gemma CogniVault ever made an outbound call, that panel would be the first thing to scream.

What you get back

Going local sounds like a trade-off — surely you lose the magic of the giant frontier models? In practice, with Gemma 4 you get more than enough:

Thinking mode — Gemma 4’s chain-of-thought streams into a collapsible panel before the answer. Watching the model reason about your documents is genuinely useful as a teaching tool.
Tool use — through the , the model decides when to search the knowledge base, summarise a document, compare two files, or check the time.
Vision — attach images and PDFs straight into a chat turn.
Generation that’s actually structured — quizzes, multi-lesson workshops, flashcard decks, and interactive mindmaps, generated with format="json" so the output parses reliably.

Cognivault doesn’t try to be a giant ecosystem. It’s a single-purpose tool that does one thing well: use your own documents with a capable local model in a private environment. I must admit that it was inspired to a great extent by , which I’ve found incredibly useful but not private enough for my needs.

The shape of the app

CogniVault is split into four sections that map to how I actually work with information on cloud-based AI tools:

Section	What it’s for
Chat	Ask anything about your documents. Cited answers, scope filter, voice in.
Knowledge Base	Upload, categorise, manage. SHA-256 detects edits on re-upload.
Study Hub	Quiz · Workshop · Flashcards · Mindmaps — four ways to drill into the source.
Dashboard	Total study time, streak, 25 badges, GitHub-style 90-day heatmap.

Everything reachable from a sidebar that remembers where you left off, on a stack that fits in your ~/Documents folder.

What comes next

This is the first in a short series. Over the next few posts I’ll dig into the parts I’m most proud of — and a few I’d build differently next time:

Hybrid retrieval — why FAISS and BM25, fused with Reciprocal Rank Fusion
Two-phase streaming with Gemma 4 and Strands Agents
Crash-resumable ingestion with DBOS, hash-aware re-ingest, OCR fallback
Getting reliable JSON out of a local LLM (and what to do when it fails)
The mindmap renderer — what hand-rolling SVG taught me, and why v2 uses React Flow
Gamifying learning — 25 badges, idle-gap sessions, 90-day heatmap
Testing a local-AI app with 350+ tests and zero infrastructure

If you want to skip ahead, the code is open source at , and there’s a .

Your data. Your hardware. Your AI. Your vault.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
RAG	Retrieval-Augmented Generation	Retrieve relevant passages from your own documents first; let the model answer from them instead of from training memory
AI	Artificial Intelligence	Software performing tasks that normally need human intelligence
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
HTTP	HyperText Transfer Protocol	The protocol browsers and APIs use to exchange requests and responses
API	Application Programming Interface	The boundary where you call someone else’s software — and where cloud audit trails stop
IHK	Industrie- und Handelskammer	The German Chamber of Commerce and Industry, which administers trainer certification
AEVO	Ausbildereignungsverordnung	The German trainer-aptitude regulation — the exam material that motivated this project
FAISS	Facebook AI Similarity Search	Meta’s vector-search library (covered in the next post)
BM25	Best Match 25	A classic keyword-ranking formula (also next post)
SDK	Software Development Kit	A library of building blocks — here, Strands, which provides the agent loop
JSON	JavaScript Object Notation	The universal text format for structured data
PDF	Portable Document Format	One of the eight-plus file types CogniVault ingests
SHA-256	Secure Hash Algorithm, 256-bit	A content fingerprint used to detect edited files on re-upload
OCR	Optical Character Recognition	Turning pictures of text (scans) into machine-readable text
DBOS	Database-Oriented Operating System	The durable-workflow library behind crash-resumable ingestion
SVG	Scalable Vector Graphics	The browser’s built-in vector drawing format

Gemma |

Part 5 · Getting Reliable JSON Out of a Local LLM

The pattern

Step 3: format="json" does real work

Step 2: schema-in-prompt that the model can actually obey

Step 4: parse, tolerantly

Step 5: drop malformed items, don’t fail the batch

Step 6: the outline retries once

Two-pass: outline first, lessons on demand

What I learned so far putting those generators together

Appendix: Abbreviations in this post

Part 3 · Two-Phase Streaming: Showing the Model Think Before It Acts

The shape

Phase 1 — Thinking only

Phase 2 — Agent with tools

Why this matters more than it sounds

ContextVar isolation, again

The frontend side of the contract

What I’d revisit

Takeaway

Appendix: Abbreviations in this post

Part 1 · Why I Built a Local-First RAG

The leaky abstraction

What “local-first” actually means here

What you get back

The shape of the app

What comes next

Appendix: Abbreviations in this post

Step 3: `format="json"` does real work