Lesson 040: Controlled Vocabulary as Schema Contract
Lesson 040: Controlled Vocabulary as Schema Contract
Problem
The vision tagging pipeline needs a consistent set of image attributes shared across five components: the vision model prompt, the attribute parser/validator, the database schema, the voting block config, and the cluster labeling engine. If any component uses an attribute code the others don't recognize — or spells it differently — the pipeline silently drops data or produces incorrect joins.
Why It Matters
In a multi-stage pipeline where upstream output becomes downstream input, the attribute names are a de facto schema. Without a single source of truth, each stage invents its own list: the model prompt might say "earth_visible," the parser expects "earth," and the voting config references "Earth." These mismatches don't cause errors — they cause silent data loss, because a confidence score for "earth_visible" simply won't match any voting rule that checks for "earth."
What Happened
- Defined a YAML config (
config/image_attributes.yaml) with two sections:base_attributes(22 codes assigned by the vision model) andderived_attributes(4 codes computed from base attribute combinations). - Each base attribute has
code,label,description, andtype. Thecodeis the canonical identifier;labelis for display;descriptionbecomes part of the VLM prompt;typegroups attributes semantically (celestial_body, environment, hardware, crew, etc.). - The
AttributeVocabularydataclass loads this YAML and enforces invariants at load time: no duplicate codes, derived rules reference only existing base codes, thresholds are consistent. - The VLM prompt is generated from
vocab.base_attributes— each attribute's code and description are injected into the prompt template. This means the prompt and the parser always agree on which codes exist. - Voting block configs (
config/voting_blocks/*.yaml) reference attribute codes inall_of,any_of,none_ofrules. Thevalidate_voting_configfunction checks every referenced code againstvocab.all_codesand rejects unknown references. - Derived attributes use boolean rules:
earth_and_moonrequiresall_of: [earth, moon]. Thecompute_derived_labelsmethod evaluates these rules against base confidence scores, applying the accepted/tentative thresholds consistently.
Design Choice: YAML Config Over Code Constants or DB Schema
Why YAML
- Readable by non-developers. A domain expert reviewing which attributes the system tracks can read the YAML without understanding Python.
- Single file, single diff. Adding an attribute means editing one YAML file. The change propagates to the prompt, parser, validator, and downstream consumers automatically through the
AttributeVocabularyclass. - Validation at load time. The
load_attribute_vocabularyfunction catches duplicate codes, invalid derived rules, and missing references before any pipeline stage runs. Compare to code constants scattered across modules, where a typo in one file might not surface until runtime.
Why derived attributes are rules, not model outputs
Derived attributes like earth_and_moon (requires both Earth and Moon to be accepted) could be asked of the VLM directly. But:
- The VLM might disagree with its own base labels (says "earth_and_moon" is absent but "earth" and "moon" are both present).
- Computing derived labels from base confidences is deterministic and auditable.
- Adding new derived attributes doesn't require re-running the model on all images.
Threshold design
Two thresholds (accepted: 0.80, tentative: 0.50) create a three-tier classification. The accepted threshold is deliberately high — VLMs are overconfident, so a 0.80 cutoff filters out weak associations while keeping genuine detections. The tentative band captures borderline cases that might be useful for analysis but shouldn't drive voting block membership.
Key Insights
- A controlled vocabulary is a schema contract between pipeline stages. It prevents the most common integration bug in multi-stage pipelines: mismatched identifiers between producer and consumer.
- Generate prompts from the same config that validates outputs. If the VLM prompt lists attributes from the vocabulary, and the parser validates against the same vocabulary, the prompt and parser can never disagree on which codes exist.
- Validate references across config files at load time, not at use time. The voting block validator checks attribute codes against the vocabulary before any votes are generated. A typo in a block config is caught immediately, not after an hour of synthetic vote generation.
- Derived attributes should be deterministic functions of base attributes. Computing
earth_and_moonfrom base scores means the derivation is auditable, reproducible, and doesn't require re-running the model. It also means derived attributes can be added retroactively to already-tagged images. - Type fields enable semantic grouping without coupling. The
typefield on base attributes (celestial_body, crew, hardware, etc.) enables grouping in UIs and reports without any component depending on specific type values. New types can be added without code changes.
Applicability
This pattern applies to any pipeline where labels or categories flow through multiple stages:
- NLP entity types shared between extraction, validation, and aggregation
- Taxonomy-based classification in content management systems
- Feature names in ML feature stores shared between extraction and model training
Does NOT apply when:
- The vocabulary is truly open-ended (free-text tags, user-generated labels)
- There's a single stage that both produces and consumes the labels
- The vocabulary changes faster than the pipeline can redeploy
Related Lessons
- Lesson 039: Mock Tagger for Vision Pipeline Testing — the mock tagger uses
vocab.base_attributesto generate deterministic scores, so the vocabulary contract is load-bearing for testing too - Lesson 027: Migration Ordering and Apply-on-Use — the DB schema must match the vocabulary; migration 009 creates columns that align with the vocabulary's structure