Lesson 052: Incremental Feature Extraction Over Full Re-runs
Lesson 052: Incremental Feature Extraction Over Full Re-runs
The Lesson
When adding new features to an existing collection, delete-and-rewrite only the new columns rather than re-processing everything. The key enabler is tagging each row with its source (model version, label source, attribute code) so that surgical deletes and inserts are possible without touching existing data.
Context
A CLIP zero-shot tagger had classified 12,217 images against 29 attributes, producing 354,293 rows in feature_image_attribute. The user wanted to add 8 new attributes (hand, porthole, full_earth, etc.) without re-running the full 5-minute tagging pass and without losing the existing 29 attributes' scores.
What Happened
The original
tag_all_images()function did a blanket delete:DELETE FROM feature_image_attribute WHERE label_source = 'clip_zero_shot'followed by a full re-tag of all images against all attributes. Simple but wasteful — the 29 existing attributes' scores hadn't changed.Built
tag_new_attributes(conn, vocab, attribute_codes)that:- Accepts a list of specific attribute codes to tag
- Deletes only rows matching those codes:
DELETE WHERE attribute_code = ? AND label_source = 'clip_zero_shot' - Builds CLIP prompts only for the specified codes
- Iterates all images but writes only the subset of attribute scores
- Identifies derived attributes whose rules reference the new codes and recomputes only those
Added auto-detection mode: when
attribute_codes=None, queries the database for which vocabulary codes have zero rows infeature_image_attributeand tags only those. This makes the workflow: add entries to YAML → run tagger → new columns appear.The incremental run for 8 new attributes took 5 minutes — the same wall-clock time as a full run, because the bottleneck is CLIP inference over 12,217 images, not the number of attributes (all prompts are processed in one forward pass). But the incremental approach preserved the existing 354,293 rows untouched — no risk of data loss, no unnecessary database churn.
For derived attributes, the function loads existing base confidence scores from the database and merges them with the newly computed scores before evaluating derivation rules. This correctly handles cases like
earth_through_porthole = all_of: [earth, porthole]whereearthscores already exist and onlyportholeis new.
Key Insights
- Tag every row with its provenance. The
label_source('clip_zero_shot', 'derived_rule') andattribute_codecolumns enable surgical deletes. Without them, the only option is "delete everything and re-insert" — the batch-processing anti-pattern. - Auto-detection eliminates manual bookkeeping. Querying "which codes are in the vocabulary but have zero rows?" means the user doesn't need to remember which attributes were already tagged. The system figures out what's missing.
- Incremental ≠ faster per-image. CLIP processes all prompts in one forward pass, so 8 prompts and 37 prompts have nearly the same per-image cost. The value of incremental is preserving existing data, not speed. If the per-attribute cost were linear (e.g., a model that processes one attribute at a time), incremental would also be faster.
- Derived attributes need merge logic. A derived attribute like
earth_through_portholedepends on bothearth(existing) andporthole(new). The incremental tagger must load existing base scores and merge with new scores before evaluating derivation rules. Skipping this step would produce incorrect derived labels. - The vocabulary file is the single source of truth for "what should exist." Adding an attribute to
image_attributes.yamland running the tagger with--incrementalguarantees the attribute gets tagged. Removing an attribute from the YAML doesn't auto-delete existing rows — that's intentional, preserving historical data.
Applicability
Incremental feature extraction applies when:
- Features are stored in a denormalized table with a type/source column
- New feature types can be added without invalidating existing ones
- The extraction cost is per-image (not per-feature), making full re-runs wasteful of I/O even if not of compute
- The feature schema is defined in configuration, not code
Does NOT apply when:
- Features are interdependent (changing one feature's model invalidates all others)
- The extraction is cheap enough that full re-runs are simpler to reason about
- The feature store uses a columnar schema where each feature is a separate column (ALTER TABLE ADD COLUMN is the "incremental" path)
Related Lessons
- Lesson 047: CLIP Zero-Shot as a Database Column Factory — the system that produces the features incrementally extracted here
- Lesson 040: Controlled Vocabulary as Schema Contract — the vocabulary drives both what to extract and what's already present
- Lesson 051: Sigmoid Calibration for Domain-Specific CLIP — the confidence calibration applied to both full and incremental tagging runs