Lesson 047: CLIP Zero-Shot as a Database Column Factory
Lesson 047: CLIP Zero-Shot as a Database Column Factory
The Lesson
A single CLIP model, used for zero-shot classification against descriptive text prompts, functions as a general-purpose column generator for structured databases. Each new prompt produces a new confidence column — no training, no fine-tuning, no labeled data. The cost of adding a column is one forward pass over the image collection, and the only tuning knobs are the prompt template and the logit-to-confidence mapping.
Context
A 12,217-image space photography collection needed structured content labels for calendar selection: which images show Earth, the Moon, spacecraft, crew, etc. The project already had CLIP embeddings (512-dim, openai/clip-vit-base-patch32) computed for clustering. The question was how to turn these embeddings into queryable, filterable database columns without training a custom classifier for each attribute.
What Happened
Tested CLIP zero-shot classification against 5 labels on a single image. The model correctly identified "moon" with 0.959 confidence using softmax over all labels. But softmax across N labels produces a probability distribution that sums to 1 — adding more labels dilutes each score, making thresholds unstable.
Switched to independent per-attribute scoring: compute CLIP logits (cosine similarity × temperature) for all prompts at once, then apply a sigmoid transform to convert each logit to an independent [0,1] score. This decouples attributes — adding "hand" doesn't change the score for "moon."
Discovered that CLIP logits are domain-specific. For Artemis space photos, logits ranged 16–32 with mean ~24. A naive sigmoid centered at 0 would classify everything as "accepted." Analyzed the logit distribution across 50 images and 29 attributes to find the right sigmoid center (25.5) and scale (1.5).
Wrapped each attribute's description in a prompt template (
"a photo of {description}") to leverage CLIP's training distribution. The description comes from the YAML vocabulary file, so adding a new attribute means adding one YAML entry and re-running the tagger.Tagged all 12,217 images against 29 attributes in 5 minutes on CPU (40 img/s, batch size 64). Added 8 more attributes incrementally — same speed, no need to re-tag existing attributes.
The result: 37 base attributes × 12,217 images = 451,829 confidence scores in
feature_image_attribute, each queryable by code, threshold, and source. Derived attributes (earth_and_moon, spacewalk, etc.) computed from base scores using boolean rules.
Key Insights
- Zero-shot is free labeling at scale. Traditional classifiers need labeled training data per category. CLIP's text-image alignment means any English phrase becomes a classifier. The "training data" is the prompt description — iterate on the description, not on a dataset.
- Independent sigmoid beats shared softmax. Softmax forces attributes to compete: raising one lowers others. Sigmoid treats each attribute independently, which matches reality — an image can show both Earth and the Moon. The tradeoff is that you need domain-calibrated sigmoid parameters instead of getting normalized probabilities for free.
- The prompt template matters more than the model.
"a photo of Earth visible in the image"outperforms bare"Earth"because CLIP was trained on image-caption pairs, not image-keyword pairs. The template bridges the gap between database column names and CLIP's training distribution. - Logit distributions are domain-specific. Space photography has a different logit landscape than ImageNet. The sigmoid center (25.5 for Artemis) must be empirically calibrated by examining the logit histogram on a sample of your actual images. A universal threshold doesn't exist.
- The vocabulary file IS the model specification. Adding
hand: "Human hand visible in the image"to a YAML file and running the tagger produces a new database column with confidence scores. No code changes, no retraining, no deployment. This makes the vocabulary iteratively refinable by domain experts who don't write code.
Examples
Adding a new attribute
Before (YAML):
base_attributes:
- code: earth
label: Earth
description: Earth visible in the image
type: celestial_body
After (add one entry):
- code: porthole
label: Porthole
description: Spacecraft window or porthole frame visible
type: environment
Run incremental tagger → 688 images accepted for "porthole" → queryable immediately.
Logit calibration
Attribute min mean max std
earth 22.8 30.0 31.7 2.6 ← strong signal
mission_control 16.6 18.7 20.6 0.9 ← weak signal
photograph 22.6 24.8 26.0 0.9 ← ambiguous
Sigmoid at center=25.5, scale=1.5 maps these to:
- earth: 0.95+ (accepted for most images)
- mission_control: <0.01 (rejected for all)
- photograph: ~0.50 (tentative — CLIP is unsure)
Applicability
This approach works when:
- You need structured labels for a large image collection and can't afford per-category training
- The categories are describable in natural language (CLIP understands English, not domain codes)
- Approximate labels with confidence scores are acceptable (CLIP is not 100% accurate)
- The collection has a coherent visual domain (allowing sigmoid calibration)
Does NOT apply when:
- Fine-grained distinctions are needed that CLIP can't resolve (e.g., species-level bird classification)
- The labels require domain expertise that CLIP lacks (e.g., medical diagnosis from imaging)
- Exact precision/recall targets must be met (CLIP's zero-shot accuracy varies by category)
Related Lessons
- Lesson 040: Controlled Vocabulary as Schema Contract — the YAML vocabulary that drives CLIP tagging also validates downstream consumers
- Lesson 039: Mock Tagger for Vision Pipeline Testing — hash-based mock tagger exercises the same confidence-threshold logic without CLIP
- Lesson 046: Lazy Imports for Deployment Compatibility — CLIP's numpy/torch dependencies must be lazy-imported for CI and static builds
- Lesson 051: Sigmoid Calibration for Domain-Specific CLIP — the sigmoid transform that converts CLIP logits to queryable confidence scores
- Lesson 052: Incremental Feature Extraction — the incremental mode that adds new vocabulary columns without re-tagging existing ones