Part 5 · Getting Reliable JSON Out of a Local LLM

Part of a series on building Gemma CogniVault. Previously: Crash-resumable ingestion with DBOS.

All abbreviations are fully explained in the appendix at the bottom of the page.

CogniVault’s Study Hub generates four kinds of structured artefacts from your documents: quizzes, multi-lesson workshops, flashcard decks, and mindmaps. All four need the model to return structured JSON, not prose. All four ride on Gemma 4 running locally via Ollama. And all four would fail far too often if I trusted the model to “just return JSON.”

Here’s the defensive pattern that brings that failure rate close to zero — and what to do about the cases that still get through.

The pattern

1. Retrieve   →  hybrid search restricted by user-selected scope
2. Prompt     →  strict schema-by-example with explicit count + shape rules
3. Generate   →  ollama.chat with format="json"  (grammar-constrained)
4. Parse      →  json.loads, tolerant of object / array / fenced shapes,
                 with a trailing-comma repair pass
5. Validate   →  drop malformed items rather than fail the whole batch
6. Retry      →  the workshop outline retries once with a stronger prompt
7. Persist    →  SQLite (progress.db) so the user can come back later

Every generator in CogniVault follows it. The interesting moves are 2, 4, and 5.

Step 3: `format="json"` does real work

Ollama exposes a format="json" option that puts the model under a grammar constraint during sampling. The decoder won’t emit tokens that would make the output invalid JSON. It’s not perfect — schemas are bigger than “valid JSON,” and the model can still produce well-formed garbage — but it eliminates the entire class of “the model started writing prose before the closing brace” failures.

If your local-LLM stack supports a grammar option (Ollama, llama.cpp, vLLM, etc.), turn it on. It’s not free (sampling is slightly slower) but the failure-mode improvement is enormous. Without it, you’ll spend most of your error budget on truncated objects.

Step 2: schema-in-prompt that the model can actually obey

format="json" guarantees the shape of the output is JSON. It says nothing about whether the JSON matches your domain schema. That’s the prompt’s job.

The pattern that works for me: instead of dumping a formal JSON Schema and saying “obey this,” include a filled-in example that shows the model the exact shape, plus explicit counts. Here’s the heart of CogniVault’s real quiz template (it lives as an editable Markdown file in backend/prompts/quiz.md):

Output ONLY a single JSON object — no prose, no markdown fences,
no text outside the JSON.

NUMBER OF QUESTIONS: EXACTLY $num_questions. This is a hard requirement.

OUTPUT SCHEMA:
{
  "questions": [
    {
      "type": one of [$types_csv],
      "question": the question text (string, no leading numbering),
      "options": array of strings (length 4 for mcq, length 2 for true_false),
      "correct_index": integer index into options (0-based),
      "explanation": 1-2 sentence explanation of the correct answer
    },
    ... exactly $num_questions entries
  ]
}

A few choices that matter:

Show the shape, don’t describe it. “Each item has a type field” gets ignored more often than the literal example.
Pin the count. “EXACTLY 10” — repeated, in capitals, as a hard requirement — is much more reliable than “around 10.”
Index, don’t repeat. The correct answer is correct_index, an integer pointing into options — not the answer text again. Repeated text invites paraphrase drift (“Paris” vs “Paris, France”), and then your grading comparison breaks.
One artefact per call. I tried generating a full workshop (outline + every lesson) in one call. The model’s quality degrades sharply as the response grows. Splitting into outline-first, lesson-on-demand is the two-pass strategy below.

Step 4: parse, tolerantly

Even with format="json", two parsing problems survive in practice.

The shape surprise. This one bit me in production: I’d assumed the model would return a bare JSON array of questions. With format="json", Gemma consistently returns an object — {"questions": [...]} — and for a while the parser only accepted the array. Result: a 502 on every quiz generation until I found it. The fix is a parser that meets the model where it is:

# Simplified from backend/services/quiz_generator.py
def extract_items(raw: str) -> list | None:
    for candidate in (raw, extract_json_object(raw), extract_json_array(raw)):
        if candidate is None:
            continue
        data = load_json_lenient(candidate)
        if isinstance(data, list):
            return data                      # bare array
        if isinstance(data, dict):
            items = data.get("questions")    # the expected object shape
            if isinstance(items, list):
                return items
    return None

Lexical glitches. Occasionally a trailing comma slips through. The repair is deliberately narrow — one regex pass, then give up:

def load_json_lenient(text: str):
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        repaired = re.sub(r",(\s*[\]}])", r"\1", text)   # strip trailing commas
        try:
            return json.loads(repaired)
        except json.JSONDecodeError:
            return None

I don’t try to balance brackets, complete truncated strings, or guess at missing fields. Either the output is fixable with a trailing-comma pass and some substring extraction, or it isn’t, and we move to step 5.

Step 5: drop malformed items, don’t fail the batch

This is the call that took me a while to make peace with.

When the model returns 10 quiz questions but #7 is missing its options field, the temptation is to error out and regenerate the whole batch. Don’t. Validate each item independently and drop the ones that fail.

# CogniVault does this with explicit field checks into a dataclass;
# pydantic works just as well.
questions = []
for raw_item in parsed_items:
    q = validate_item(raw_item, allowed_types)   # returns None if malformed
    if q is not None:
        questions.append(q)

The user gets 9 questions instead of 10. They don’t notice. Re-running the whole generation to fix question #7 takes 30 seconds and might introduce new failures in questions 1-6. The dropped-item approach is strictly better UX. (The model also sometimes overshoots the count — the validated list is simply trimmed back to what was asked for.)

Step 6: the outline retries once

Workshops are the exception that proves the rule. A workshop is a structured outline (title, summary, lesson list) plus each lesson’s content. The outline must parse — there’s no partial success for a table of contents — so a parse failure there triggers exactly one retry, with the prompt re-sent plus a stern reminder: “Your previous response was unparseable. Output ONLY a single valid JSON object.” If the second attempt fails too, the user gets a clear error suggesting a narrower scope.

One retry, not three. Three retries when the model is consistently confused is just wasted seconds and watts.

The lessons themselves, interestingly, are not JSON at all. A lesson body is prose — forcing it into a JSON string would buy nothing and cost escaping headaches. Lessons are generated as plain Markdown, then run through a small cleanup pass that strips chat-isms the model sometimes adds despite instructions (“I hope this helps!”, “Let me know if…”). Different output, different contract.

Two-pass: outline first, lessons on demand

Workshops use a two-pass generation pattern:

Pass 1 — generate outline:    {"title": ..., "lessons": [{"title": ...}, ...]}   (cheap, JSON)
Pass 2 — for each lesson:     a full Markdown lesson body                        (on demand)

The outline is fast and lets the user see the shape of the workshop immediately. Each lesson is generated when the user opens it — meaning the user is reading lesson 1 while deciding whether they even want lesson 5. The total wall-clock time to “first useful content” is small even for a 10-lesson workshop.

This is the same architectural move the chat side makes with two-phase streaming: split a slow operation into a tiny fast part and a larger slow part, hand the user the fast part immediately.

What I learned so far putting those generators together

A few principles distilled from the four generators:

Use the grammar option in your inference stack. Don’t try to coax JSON out of a free-form decoder.
Pin every quantifier in the prompt. “Exactly 10,” “exactly 4 options,” “one or two sentences.” Vague counts = inconsistent output.
Don’t assume the top-level shape. Grammar-constrained Gemma likes objects; your code might expect arrays. Accept both — the parser is cheaper than relying on the model to return the expected shape.
Drop, don’t fail. Lossy success beats brittle perfection.
One retry, never more. If two tries can’t produce valid output, the prompt is wrong, not the model.
Split large generations. Outline + lessons. Skeleton + body. Two small calls beat one big one almost every time. And if a part of the output is naturally prose, let it be prose.

Local LLMs in 2026 are good enough that structured generation is genuinely usable for production-shaped features. They are not so good that you can skip the defensive scaffolding. The scaffolding above is maybe 80 lines of code total across all four generators, and it’s the difference between “demo-quality” and “I trust this enough to ship.”

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
JSON	JavaScript Object Notation	The structured text format the generators must produce
LLM	Large Language Model	A neural network trained on huge amounts of text that can read and generate language
AI	Artificial Intelligence	Software performing tasks that normally need human intelligence
MCQ	Multiple-Choice Question	One of the two quiz question types (the other is true/false)
UX	User Experience	Why 9 valid questions beat a regeneration error
SQLite	(SQL = Structured Query Language)	The single-file database where generated artefacts persist
DBOS	Database-Oriented Operating System	The durable-workflow library from the previous post
HTTP 502	Bad Gateway (HyperText Transfer Protocol status code)	The error my array-only parser produced until I accepted Gemma’s object shape

Next up: The mindmap renderer — what hand-rolling an SVG radial layout taught me, and why version two uses React Flow anyway.

No results found

Part 5 · Getting Reliable JSON Out of a Local LLM

The pattern

Step 3: `format="json"` does real work

Step 2: schema-in-prompt that the model can actually obey

Step 4: parse, tolerantly

Step 5: drop malformed items, don’t fail the batch

Step 6: the outline retries once

Two-pass: outline first, lessons on demand

What I learned so far putting those generators together

Appendix: Abbreviations in this post

Related

No results found

Part 5 · Getting Reliable JSON Out of a Local LLM

The pattern

Step 3: format="json" does real work

Step 2: schema-in-prompt that the model can actually obey

Step 4: parse, tolerantly

Step 5: drop malformed items, don’t fail the batch

Step 6: the outline retries once

Two-pass: outline first, lessons on demand

What I learned so far putting those generators together

Appendix: Abbreviations in this post

Related

Step 3: `format="json"` does real work