Lesson 039: Mock Tagger Pattern for Vision Pipeline Testing

Lesson 039: Mock Tagger Pattern for Vision Pipeline Testing

Problem

The vision tagging pipeline uses Qwen2.5-VL (a 7B-parameter vision-language model) to classify image attributes. Running the real model requires a GPU, takes seconds per image, and produces non-deterministic outputs. The full pipeline — config loading, tagging, derived label computation, DB persistence, cluster labeling, voting block generation, bias analysis — needs to be testable in CI without any model infrastructure.

Why It Matters

Vision-language models are expensive and slow. If every test that touches image attributes requires loading a 7B model, the test suite becomes unusable: minutes of GPU time per run, flaky results from non-deterministic generation, and CI runners without GPU support can't run the tests at all. But if you test everything except the model, you miss integration bugs between the model output format and the downstream pipeline. The mock must produce structurally valid outputs that exercise the same code paths as real model outputs.

What Happened

  1. Built VisionTagger as the real tagger: loads Qwen2.5-VL lazily on first use, sends a structured prompt listing all attribute codes with descriptions, parses JSON output, validates against the AttributeVocabulary, and returns {code: confidence} dicts.
  2. Built MockTagger as a parallel implementation with the same interface (tag_image, tag_batch). Instead of running a model, it hashes the filename with SHA-256 and maps hash bytes to deterministic confidence scores for each attribute code.
  3. The hash-based approach means: (a) the same filename always produces the same attributes, making tests deterministic; (b) different filenames produce different attribute distributions, so tests can reason about which images have which labels; (c) the confidence scores spread across the full 0.0–1.0 range, exercising accepted/tentative/rejected classification logic.
  4. Both taggers produce the same output type (dict[str, float]), so all downstream code — compute_derived_labels, compute_accepted_base_labels, the DB loader, cluster labeling, voting block matching — works identically with either tagger.
  5. The acceptance test fixture (full_pipeline_db) bypasses the tagger entirely and inserts attributes directly into feature_image_attribute, because the test needs controlled attribute assignments (exactly images 1–25 have Earth+Moon) rather than hash-derived ones. This is a third strategy: direct DB seeding for scenario-specific tests.

Design Choice: Hash-Based Determinism Over Random or Fixture

Why hash the filename

Three alternatives were considered:

Why three testing strategies coexist

  1. MockTagger — for unit tests of the tagging pipeline itself (prompt building, output parsing, vocabulary validation).
  2. Direct DB seeding — for acceptance tests that need specific attribute distributions (25 Earth+Moon images, 25 Earth-only, etc.) to verify bias detection at known thresholds.
  3. Real VisionTagger — for manual validation only, never in CI.

Each strategy serves a different testing need. Collapsing to a single strategy would either make tests slow (real model) or unable to test controlled scenarios (mock only).

Key Insights

Applicability

This pattern applies whenever a pipeline includes an expensive, non-deterministic external component (ML model, API call, hardware sensor):

Does NOT apply when:

Related Lessons