Part 5 · Getting Reliable JSON Out of a Local LLM
Part of a series on building Gemma CogniVault. Previously: Crash-resumable ingestion with DBOS.
All abbreviations are fully explained in the appendix at the bottom of the page.
CogniVault’s Study Hub generates four kinds of structured artefacts from your documents: quizzes, multi-lesson workshops, flashcard decks, and mindmaps. All four need the model to return structured JSON, not prose. All four ride on Gemma 4 running locally via Ollama. And all four would fail far too often if I trusted the model to “just return JSON.”
Here’s the defensive pattern that brings that failure rate close to zero — and what to do about the cases that still get through.
The pattern
1. Retrieve → hybrid search restricted by user-selected scope
2. Prompt → strict schema-by-example with explicit count + shape rules
3. Generate → ollama.chat with format="json" (grammar-constrained)
4. Parse → json.loads, tolerant of object / array / fenced shapes,
with a trailing-comma repair pass
5. Validate → drop malformed items rather than fail the whole batch
6. Retry → the workshop outline retries once with a stronger prompt
7. Persist → SQLite (progress.db) so the user can come back later
Every generator in CogniVault follows it. The interesting moves are 2, 4, and 5.
Step 3: format="json" does real work
Ollama exposes a format="json" option that puts the model under a grammar constraint during sampling. The decoder won’t emit tokens that would make the output invalid JSON. It’s not perfect — schemas are bigger than “valid JSON,” and the model can still produce well-formed garbage — but it eliminates the entire class of “the model started writing prose before the closing brace” failures.
If your local-LLM stack supports a grammar option (Ollama, llama.cpp, vLLM, etc.), turn it on. It’s not free (sampling is slightly slower) but the failure-mode improvement is enormous. Without it, you’ll spend most of your error budget on truncated objects.
Step 2: schema-in-prompt that the model can actually obey
format="json" guarantees the shape of the output is JSON. It says nothing about whether the JSON matches your domain schema. That’s the prompt’s job.
The pattern that works for me: instead of dumping a formal JSON Schema and saying “obey this,” include a filled-in example that shows the model the exact shape, plus explicit counts. Here’s the heart of CogniVault’s real quiz template (it lives as an editable Markdown file in backend/prompts/quiz.md):
Output ONLY a single JSON object — no prose, no markdown fences,
no text outside the JSON.
NUMBER OF QUESTIONS: EXACTLY $num_questions. This is a hard requirement.
OUTPUT SCHEMA:
{
"questions": [
{
"type": one of [$types_csv],
"question": the question text (string, no leading numbering),
"options": array of strings (length 4 for mcq, length 2 for true_false),
"correct_index": integer index into options (0-based),
"explanation": 1-2 sentence explanation of the correct answer
},
... exactly $num_questions entries
]
}
A few choices that matter:
- Show the shape, don’t describe it. “Each item has a
typefield” gets ignored more often than the literal example. - Pin the count. “EXACTLY 10” — repeated, in capitals, as a hard requirement — is much more reliable than “around 10.”
- Index, don’t repeat. The correct answer is
correct_index, an integer pointing intooptions— not the answer text again. Repeated text invites paraphrase drift (“Paris” vs “Paris, France”), and then your grading comparison breaks. - One artefact per call. I tried generating a full workshop (outline + every lesson) in one call. The model’s quality degrades sharply as the response grows. Splitting into outline-first, lesson-on-demand is the two-pass strategy below.
Step 4: parse, tolerantly
Even with format="json", two parsing problems survive in practice.
The shape surprise. This one bit me in production: I’d assumed the model would return a bare JSON array of questions. With format="json", Gemma consistently returns an object — {"questions": [...]} — and for a while the parser only accepted the array. Result: a 502 on every quiz generation until I found it. The fix is a parser that meets the model where it is:
# Simplified from backend/services/quiz_generator.py
def extract_items(raw: str) -> list | None:
for candidate in (raw, extract_json_object(raw), extract_json_array(raw)):
if candidate is None:
continue
data = load_json_lenient(candidate)
if isinstance(data, list):
return data # bare array
if isinstance(data, dict):
items = data.get("questions") # the expected object shape
if isinstance(items, list):
return items
return None
Lexical glitches. Occasionally a trailing comma slips through. The repair is deliberately narrow — one regex pass, then give up:
def load_json_lenient(text: str):
try:
return json.loads(text)
except json.JSONDecodeError:
repaired = re.sub(r",(\s*[\]}])", r"\1", text) # strip trailing commas
try:
return json.loads(repaired)
except json.JSONDecodeError:
return None
I don’t try to balance brackets, complete truncated strings, or guess at missing fields. Either the output is fixable with a trailing-comma pass and some substring extraction, or it isn’t, and we move to step 5.
Step 5: drop malformed items, don’t fail the batch
This is the call that took me a while to make peace with.
When the model returns 10 quiz questions but #7 is missing its options field, the temptation is to error out and regenerate the whole batch. Don’t. Validate each item independently and drop the ones that fail.
# CogniVault does this with explicit field checks into a dataclass;
# pydantic works just as well.
questions = []
for raw_item in parsed_items:
q = validate_item(raw_item, allowed_types) # returns None if malformed
if q is not None:
questions.append(q)
The user gets 9 questions instead of 10. They don’t notice. Re-running the whole generation to fix question #7 takes 30 seconds and might introduce new failures in questions 1-6. The dropped-item approach is strictly better UX. (The model also sometimes overshoots the count — the validated list is simply trimmed back to what was asked for.)
Step 6: the outline retries once
Workshops are the exception that proves the rule. A workshop is a structured outline (title, summary, lesson list) plus each lesson’s content. The outline must parse — there’s no partial success for a table of contents — so a parse failure there triggers exactly one retry, with the prompt re-sent plus a stern reminder: “Your previous response was unparseable. Output ONLY a single valid JSON object.” If the second attempt fails too, the user gets a clear error suggesting a narrower scope.
One retry, not three. Three retries when the model is consistently confused is just wasted seconds and watts.
The lessons themselves, interestingly, are not JSON at all. A lesson body is prose — forcing it into a JSON string would buy nothing and cost escaping headaches. Lessons are generated as plain Markdown, then run through a small cleanup pass that strips chat-isms the model sometimes adds despite instructions (“I hope this helps!”, “Let me know if…”). Different output, different contract.
Two-pass: outline first, lessons on demand
Workshops use a two-pass generation pattern:
Pass 1 — generate outline: {"title": ..., "lessons": [{"title": ...}, ...]} (cheap, JSON)
Pass 2 — for each lesson: a full Markdown lesson body (on demand)
The outline is fast and lets the user see the shape of the workshop immediately. Each lesson is generated when the user opens it — meaning the user is reading lesson 1 while deciding whether they even want lesson 5. The total wall-clock time to “first useful content” is small even for a 10-lesson workshop.
This is the same architectural move the chat side makes with two-phase streaming: split a slow operation into a tiny fast part and a larger slow part, hand the user the fast part immediately.
What I learned so far putting those generators together
A few principles distilled from the four generators:
- Use the grammar option in your inference stack. Don’t try to coax JSON out of a free-form decoder.
- Pin every quantifier in the prompt. “Exactly 10,” “exactly 4 options,” “one or two sentences.” Vague counts = inconsistent output.
- Don’t assume the top-level shape. Grammar-constrained Gemma likes objects; your code might expect arrays. Accept both — the parser is cheaper than relying on the model to return the expected shape.
- Drop, don’t fail. Lossy success beats brittle perfection.
- One retry, never more. If two tries can’t produce valid output, the prompt is wrong, not the model.
- Split large generations. Outline + lessons. Skeleton + body. Two small calls beat one big one almost every time. And if a part of the output is naturally prose, let it be prose.
Local LLMs in 2026 are good enough that structured generation is genuinely usable for production-shaped features. They are not so good that you can skip the defensive scaffolding. The scaffolding above is maybe 80 lines of code total across all four generators, and it’s the difference between “demo-quality” and “I trust this enough to ship.”
Appendix: Abbreviations in this post
| Abbreviation | Full form | Meaning |
|---|---|---|
| JSON | JavaScript Object Notation | The structured text format the generators must produce |
| LLM | Large Language Model | A neural network trained on huge amounts of text that can read and generate language |
| AI | Artificial Intelligence | Software performing tasks that normally need human intelligence |
| MCQ | Multiple-Choice Question | One of the two quiz question types (the other is true/false) |
| UX | User Experience | Why 9 valid questions beat a regeneration error |
| SQLite | (SQL = Structured Query Language) | The single-file database where generated artefacts persist |
| DBOS | Database-Oriented Operating System | The durable-workflow library from the previous post |
| HTTP 502 | Bad Gateway (HyperText Transfer Protocol status code) | The error my array-only parser produced until I accepted Gemma’s object shape |
Next up: The mindmap renderer — what hand-rolling an SVG radial layout taught me, and why version two uses React Flow anyway.
