Lesson 040: Controlled Vocabulary as Schema Contract

Lesson 040: Controlled Vocabulary as Schema Contract

Problem

The vision tagging pipeline needs a consistent set of image attributes shared across five components: the vision model prompt, the attribute parser/validator, the database schema, the voting block config, and the cluster labeling engine. If any component uses an attribute code the others don't recognize — or spells it differently — the pipeline silently drops data or produces incorrect joins.

Why It Matters

In a multi-stage pipeline where upstream output becomes downstream input, the attribute names are a de facto schema. Without a single source of truth, each stage invents its own list: the model prompt might say "earth_visible," the parser expects "earth," and the voting config references "Earth." These mismatches don't cause errors — they cause silent data loss, because a confidence score for "earth_visible" simply won't match any voting rule that checks for "earth."

What Happened

  1. Defined a YAML config (config/image_attributes.yaml) with two sections: base_attributes (22 codes assigned by the vision model) and derived_attributes (4 codes computed from base attribute combinations).
  2. Each base attribute has code, label, description, and type. The code is the canonical identifier; label is for display; description becomes part of the VLM prompt; type groups attributes semantically (celestial_body, environment, hardware, crew, etc.).
  3. The AttributeVocabulary dataclass loads this YAML and enforces invariants at load time: no duplicate codes, derived rules reference only existing base codes, thresholds are consistent.
  4. The VLM prompt is generated from vocab.base_attributes — each attribute's code and description are injected into the prompt template. This means the prompt and the parser always agree on which codes exist.
  5. Voting block configs (config/voting_blocks/*.yaml) reference attribute codes in all_of, any_of, none_of rules. The validate_voting_config function checks every referenced code against vocab.all_codes and rejects unknown references.
  6. Derived attributes use boolean rules: earth_and_moon requires all_of: [earth, moon]. The compute_derived_labels method evaluates these rules against base confidence scores, applying the accepted/tentative thresholds consistently.

Design Choice: YAML Config Over Code Constants or DB Schema

Why YAML

Why derived attributes are rules, not model outputs

Derived attributes like earth_and_moon (requires both Earth and Moon to be accepted) could be asked of the VLM directly. But:

Threshold design

Two thresholds (accepted: 0.80, tentative: 0.50) create a three-tier classification. The accepted threshold is deliberately high — VLMs are overconfident, so a 0.80 cutoff filters out weak associations while keeping genuine detections. The tentative band captures borderline cases that might be useful for analysis but shouldn't drive voting block membership.

Key Insights

Applicability

This pattern applies to any pipeline where labels or categories flow through multiple stages:

Does NOT apply when:

Related Lessons