Part 8 · Testing a Local-AI App: 351 Tests, Zero Infrastructure

Part of a series on building Gemma CogniVault. Previously: Gamifying learning — badges, heatmap, and idle-gap sessions. All abbreviations are fully explained in the appendix at the bottom of the page.

CogniVault has 351 tests across 22 files (at the time of writing — the suite grows with the app). None of them need Ollama. None of them need Postgres. None of them need a real PDF, a microphone, or an internet connection. The whole suite runs in about three seconds on my laptop.

That’s not because there isn’t much to test — the surface is wide. It’s because the test suite is built around one principle: mock at the edge, real everywhere else. This post is about what “the edge” means in a local-AI app, and how to draw the line so the suite stays useful instead of decorative.

The 22 test files

File	What it covers
`test_api.py`	The HTTP endpoints (upload, ingest, RAG, history, KB browsing)
`test_tools.py`	Calculator, clock, KB search tool
`test_thinking.py`	Two-phase stream, thinking tokens, session isolation
`test_chat_attachments.py`	Multi-file attach, PDF/DOCX extraction, size limits
`test_chat_memory.py`	Session history budget, trimming, restart rebuild
`test_doc_scope_filter.py`	Per-request ContextVar isolation, search filtering
`test_doc_tools.py`	`list_documents`, `analyze_document`, `compare_documents`
`test_edit_regenerate.py`	History rewind, trim_history_to_turns validation
`test_structure_chunking.py`	Markdown header splits, CSV row batches, doc types
`test_ocr_fallback.py`	OCR trigger threshold, graceful degradation
`test_new_formats.py`	PPTX, XLSX, HTML extractors, extension routing
`test_docx_url.py`	DOCX ingestion and URL import (with the SSRF guard)
`test_reingest.py`	SHA-256 change detection, idempotency
`test_vector_db.py`	BM25, FAISS, RRF fusion, hybrid search
`test_audio.py`	Whisper transcription endpoint
`test_progress.py`	Sessions, daily aggregation, achievement criteria
`test_prompts.py`	The prompt-template loader and custom overrides
`test_vault_stats.py`	The Privacy Vault Audit numbers
`test_quiz.py` / `test_workshop.py` / `test_flashcards.py` / `test_mindmaps.py`	Per-mode parsing, endpoints, achievements

Everything that can be tested in isolation is tested in isolation. Everything that needs to be tested through the FastAPI layer is, but the only things mocked are the calls that cross the process boundary.

What gets mocked, what doesn’t

The single most important question in a project like this: where do you stub?

[ React frontend ]   ←─ not in scope for backend tests
       │
       ▼
[ FastAPI handlers ]  ←─ tested directly with TestClient
       │
       ▼
[ services/ ]         ←─ tested directly (vector_db, rag_agent, generators)
       │
       ├─► [ FAISS + BM25 ]    ←─ real, in-memory, fast
       ├─► [ SQLite ]          ←─ real, against a tmp_path file
       ├─► [ DBOS ]            ←─ patched (no launch, no Postgres)
       ├─► [ Ollama ]          ←─ patched at each service's import site
       └─► [ Whisper ]         ←─ stubbed (no 145 MB model load)

The rule of thumb: anything that crosses a process or network boundary, mock. Anything in-process, run for real.

FAISS and BM25 are real because they’re libraries we link into the test process. SQLite is real because it’s a file. DBOS is patched because launching it expects a Postgres connection, and that’s network. Ollama is patched because it’s HTTP. Whisper is stubbed because loading a 145 MB model in a unit test is silly.

That principle keeps the test suite fast (no I/O the OS can’t handle in milliseconds) and meaningful (the real code paths through retrieval, chunking, parsing, scope filtering all execute).

Mocking Ollama

Most CogniVault tests need some model output, but they don’t care what model produced it. Each service imports the ollama module directly, so the tests patch that reference at the service’s own import site:

# Real pattern from test_quiz.py
from unittest.mock import patch
from backend.services import quiz_generator

def test_quiz_parses_questions():
    fake = {"message": {"content": json.dumps({"questions": [VALID_MCQ] * 5})}}
    with patch.object(quiz_generator, "ollama") as mock_ollama:
        mock_ollama.chat.return_value = fake
        result = quiz_generator.generate_quiz(
            difficulty="beginner", num_questions=5, question_types=["mcq"],
        )
    assert len(result.questions) == 5

A streaming variant feeds chunk sequences instead of a single response, used by the RAG and thinking tests. The key property: one patch.object against the module the service actually uses. No deep mock hierarchies, no fragile string paths into third-party internals. Easy to read in a code review, easy to debug when a test fails.

Mocking DBOS

DBOS expects launch() to connect to Postgres. The shared client fixture in conftest.py simply patches the dbos instance before the app is exercised:

# Real pattern from conftest.py
@pytest.fixture()
def client():
    """A FastAPI TestClient with DBOS launch mocked out — no Postgres needed."""
    with patch("backend.services.ingest.dbos") as mock_dbos:
        mock_dbos.launch = MagicMock()
        from backend.main import app
        with TestClient(app) as c:
            yield c

The decorated workflow steps still execute as ordinary Python functions — we lose the durability semantics, but the tests aren’t testing durability, they’re testing the business logic inside the steps (hash detection, extraction, chunking). The durability layer has its own tests upstream, in DBOS’s own suite.

There’s a second isolation layer that runs on every test automatically: an autouse fixture points the docs folder, FAISS index, and metadata file at a per-test tmp_path via environment variables, so no test can ever touch real data on disk.

Real SQLite, with one override

Progress tracking, achievements, quiz storage, deck CRUD — all SQLite. The progress tracker exposes a single test seam: a module-level path override.

# Real pattern from test_quiz.py
@pytest.fixture(autouse=True)
def _isolate_progress_db(tmp_path, monkeypatch):
    monkeypatch.setattr(progress_tracker, "_db_path_override",
                        str(tmp_path / "progress_test.db"))

Every test gets a fresh database file; the schema auto-creates on first use. No connection pooling drama, no leaked state between tests, no in-memory :memory: gymnastics. Just a temp file per test.

This is the kind of test that catches bugs an SQL-level mock would never see — a missing index, a botched migration, a constraint that doesn’t fire. SQLite is fast enough on every machine I’ve ever owned that “use the real database” isn’t even a trade-off.

The TestClient pattern

For HTTP tests, FastAPI’s TestClient runs the app in-process. The upload, the validation, the chunking, the vector-store update, the response serialisation — every layer runs for real. Only the calls that would leave the process (the Ollama embedding call inside ingestion, the model call inside generation) are patched. That’s the right line: the test verifies the integration of those layers, but doesn’t depend on an external service.

The streaming endpoint tests use a slightly different style — they iterate the response body and parse each NDJSON line (one JSON envelope per line, as described in the streaming post) — but the principle is identical.

Coverage gaps I accept

Three things the test suite doesn’t cover:

The frontend. No React testing in this suite — that’s a separate concern. Most failures show up in API tests anyway, because the frontend is a thin client over a typed API.
Real Ollama prompt quality. Whether gemma4:e4b actually produces useful quiz questions is not a thing tests can answer. That’s evaluation, not testing. It belongs in a separate harness with a real model running.
Race conditions across DBOS workflow restarts. The resume path is exercised at the logic level, but the full state space of “what happens if Postgres goes away at this exact instant” is too large to enumerate.

These are conscious gaps. The test suite is for catching regressions in code I wrote; it’s not a replacement for evaluation, integration testing, or actual chaos engineering.

What the suite is actually for

Two things, in order:

Refactor confidence. When I rip out the agent loop and put a new one in, do the tests still pass? If yes, the API contracts I care about haven’t drifted.
PR review surface. Every PR runs the suite in CI. A green run is a precondition for merge. The suite is loud enough that a real regression makes the noise.

Notice what it isn’t for: proving the model works. It can’t. Tests can pin behaviour but they can’t pin quality. That’s a different muscle, and it belongs in a different harness.

What’s worth borrowing

If you’re building a local-AI app and your tests need Ollama running:

Patch the ollama module at each service’s import site with patch.object(service_module, "ollama") — one seam per service, no shims required.
Give your DB layer a path override and run against a tmp_path SQLite file.
Use an autouse fixture to redirect every on-disk artefact (docs folder, index files) to tmp_path, so no test can touch real data even by accident.
For each external service (model, audio, workflow engine), draw the seam at the process boundary. Test everything above it with real code.

The result is a suite where every test runs in any environment, finishes in milliseconds, and exercises the actual integration of every layer of code you wrote. 351 tests in about three seconds isn’t an optimisation, it’s a side-effect of mocking only at the edges.

Appendix: Abbreviations in this post

Abbreviation	Full form	Meaning
CI	Continuous Integration	Automatically running the test suite on every push/PR
PR	Pull Request	A proposed code change — merged only when the suite is green
API	Application Programming Interface	The HTTP surface the TestClient exercises in-process
HTTP	HyperText Transfer Protocol	The protocol the (in-process) endpoint tests speak
RAG	Retrieval-Augmented Generation	The retrieval-then-answer pipeline under test
KB	Knowledge Base	The indexed document collection
FAISS	Facebook AI Similarity Search	Real in tests — it’s an in-process library
BM25	Best Match 25	The keyword index — also real in tests
RRF	Reciprocal Rank Fusion	The rank-merging formula covered by `test_vector_db.py`
SQLite / SQL	(SQL = Structured Query Language)	The real, file-based database every progress test runs against
DBOS	Database-Oriented Operating System	The durable-workflow library — patched so no Postgres is needed
OCR	Optical Character Recognition	The scanned-PDF fallback with its own trigger-threshold tests
SSRF	Server-Side Request Forgery	The URL-import attack class covered in `test_docx_url.py`
NDJSON	Newline-Delimited JSON	The streaming format the endpoint tests parse line by line
SHA-256	Secure Hash Algorithm, 256-bit	The content fingerprint behind the re-ingest tests
CRUD	Create, Read, Update, Delete	The basic storage operations for decks, quizzes, and maps
PDF / DOCX / PPTX / XLSX / HTML	Portable Document Format / Word / PowerPoint / Excel / HyperText Markup Language	The extractor formats with dedicated tests

That’s the series. Eight posts on the parts of Gemma CogniVault I’m most proud of — and a handful I’d build differently. If any of it was useful to you, the code is open source at github.com/ndimoforaretas/local-gemma-rag, and the demo walkthrough is on YouTube.

Your data. Your hardware. Your AI. Your vault.

No results found