Lesson 052: Incremental Feature Extraction Over Full Re-runs

Lesson 052: Incremental Feature Extraction Over Full Re-runs

The Lesson

When adding new features to an existing collection, delete-and-rewrite only the new columns rather than re-processing everything. The key enabler is tagging each row with its source (model version, label source, attribute code) so that surgical deletes and inserts are possible without touching existing data.

Context

A CLIP zero-shot tagger had classified 12,217 images against 29 attributes, producing 354,293 rows in feature_image_attribute. The user wanted to add 8 new attributes (hand, porthole, full_earth, etc.) without re-running the full 5-minute tagging pass and without losing the existing 29 attributes' scores.

What Happened

  1. The original tag_all_images() function did a blanket delete: DELETE FROM feature_image_attribute WHERE label_source = 'clip_zero_shot' followed by a full re-tag of all images against all attributes. Simple but wasteful — the 29 existing attributes' scores hadn't changed.

  2. Built tag_new_attributes(conn, vocab, attribute_codes) that:

    • Accepts a list of specific attribute codes to tag
    • Deletes only rows matching those codes: DELETE WHERE attribute_code = ? AND label_source = 'clip_zero_shot'
    • Builds CLIP prompts only for the specified codes
    • Iterates all images but writes only the subset of attribute scores
    • Identifies derived attributes whose rules reference the new codes and recomputes only those
  3. Added auto-detection mode: when attribute_codes=None, queries the database for which vocabulary codes have zero rows in feature_image_attribute and tags only those. This makes the workflow: add entries to YAML → run tagger → new columns appear.

  4. The incremental run for 8 new attributes took 5 minutes — the same wall-clock time as a full run, because the bottleneck is CLIP inference over 12,217 images, not the number of attributes (all prompts are processed in one forward pass). But the incremental approach preserved the existing 354,293 rows untouched — no risk of data loss, no unnecessary database churn.

  5. For derived attributes, the function loads existing base confidence scores from the database and merges them with the newly computed scores before evaluating derivation rules. This correctly handles cases like earth_through_porthole = all_of: [earth, porthole] where earth scores already exist and only porthole is new.

Key Insights

Applicability

Incremental feature extraction applies when:

Does NOT apply when:

Related Lessons