Lesson 047: CLIP Zero-Shot as a Database Column Factory

Lesson 047: CLIP Zero-Shot as a Database Column Factory

The Lesson

A single CLIP model, used for zero-shot classification against descriptive text prompts, functions as a general-purpose column generator for structured databases. Each new prompt produces a new confidence column — no training, no fine-tuning, no labeled data. The cost of adding a column is one forward pass over the image collection, and the only tuning knobs are the prompt template and the logit-to-confidence mapping.

Context

A 12,217-image space photography collection needed structured content labels for calendar selection: which images show Earth, the Moon, spacecraft, crew, etc. The project already had CLIP embeddings (512-dim, openai/clip-vit-base-patch32) computed for clustering. The question was how to turn these embeddings into queryable, filterable database columns without training a custom classifier for each attribute.

What Happened

  1. Tested CLIP zero-shot classification against 5 labels on a single image. The model correctly identified "moon" with 0.959 confidence using softmax over all labels. But softmax across N labels produces a probability distribution that sums to 1 — adding more labels dilutes each score, making thresholds unstable.

  2. Switched to independent per-attribute scoring: compute CLIP logits (cosine similarity × temperature) for all prompts at once, then apply a sigmoid transform to convert each logit to an independent [0,1] score. This decouples attributes — adding "hand" doesn't change the score for "moon."

  3. Discovered that CLIP logits are domain-specific. For Artemis space photos, logits ranged 16–32 with mean ~24. A naive sigmoid centered at 0 would classify everything as "accepted." Analyzed the logit distribution across 50 images and 29 attributes to find the right sigmoid center (25.5) and scale (1.5).

  4. Wrapped each attribute's description in a prompt template ("a photo of {description}") to leverage CLIP's training distribution. The description comes from the YAML vocabulary file, so adding a new attribute means adding one YAML entry and re-running the tagger.

  5. Tagged all 12,217 images against 29 attributes in 5 minutes on CPU (40 img/s, batch size 64). Added 8 more attributes incrementally — same speed, no need to re-tag existing attributes.

  6. The result: 37 base attributes × 12,217 images = 451,829 confidence scores in feature_image_attribute, each queryable by code, threshold, and source. Derived attributes (earth_and_moon, spacewalk, etc.) computed from base scores using boolean rules.

Key Insights

Examples

Adding a new attribute

Before (YAML):

base_attributes:
  - code: earth
    label: Earth
    description: Earth visible in the image
    type: celestial_body

After (add one entry):

  - code: porthole
    label: Porthole
    description: Spacecraft window or porthole frame visible
    type: environment

Run incremental tagger → 688 images accepted for "porthole" → queryable immediately.

Logit calibration

Attribute          min    mean   max    std
earth              22.8   30.0   31.7   2.6    ← strong signal
mission_control    16.6   18.7   20.6   0.9    ← weak signal
photograph         22.6   24.8   26.0   0.9    ← ambiguous

Sigmoid at center=25.5, scale=1.5 maps these to:

Applicability

This approach works when:

Does NOT apply when:

Related Lessons