Lesson 051: Sigmoid Calibration for Domain-Specific CLIP Scores

Lesson 051: Sigmoid Calibration for Domain-Specific CLIP Scores

The Lesson

CLIP logits have domain-specific distributions. Converting them to meaningful [0,1] confidence scores requires a sigmoid transform calibrated to the actual logit range in your image collection. A universal threshold doesn't work — the sigmoid center and scale must be tuned empirically by examining logit histograms on a representative sample.

Context

CLIP zero-shot classification was being used to tag 12,217 space photographs with 29 content attributes (earth, moon, deep_space, spacecraft, etc.). CLIP outputs logits — cosine similarity between image and text embeddings, scaled by a learned temperature (~100). These logits need to become [0,1] confidence scores with a threshold-based classification: accepted (≥ 0.80), tentative (0.50–0.79), rejected (< 0.50).

What Happened

  1. First attempt used softmax across all 29 attributes. This gave normalized probabilities summing to 1, but adding more attributes diluted every score. An image that was clearly "earth" scored 0.456 because the probability mass was spread across 29 categories. Thresholds were unstable — they'd need recalibration every time an attribute was added.

  2. Second attempt used per-attribute binary classification: for each attribute, compare the attribute prompt against a generic negative ("a photo of something else entirely"). This gave independent scores but was 29× slower (one forward pass per attribute per batch) and the generic negative was too vague — CLIP gave high scores to almost everything.

  3. Third attempt: compute all 29 logits in a single forward pass, then apply a sigmoid transform to each logit independently. This gives independent [0,1] scores without the softmax competition problem and without the 29× speed penalty.

  4. But sigmoid with default center=0 classified everything as accepted, because CLIP logits for Artemis photos ranged 16–32, all far above zero.

  5. Ran a calibration experiment: 50 diverse images, all 29 attributes, extracted raw logits. Found the distribution:

    • Strong matches (earth, atmosphere): logits 28–32
    • Moderate matches (spacecraft, rocket): logits 24–28
    • Weak matches (mission_control, diagram): logits 16–21
    • Per-attribute std: 0.9–3.1
  6. Chose sigmoid center=25.5 and scale=1.5 based on the distribution:

    • Logit > 27 → score > 0.84 → "accepted"
    • Logit 24–27 → score 0.24–0.84 → "tentative"
    • Logit < 24 → score < 0.24 → "rejected"
  7. Validated by inspecting per-cluster attribute counts. Earth-from-orbit images correctly got earth, atmosphere, surface as accepted. Dark space images got deep_space. Crew photos got astronaut. The thresholds separated genuine content from CLIP noise.

Key Insights

Examples

Logit distribution by attribute type

Attribute           min    mean   max    Decision at center=25.5
─────────────────────────────────────────────────────────────────
earth               22.8   30.0   31.7   Mostly accepted
atmosphere          21.5   29.3   31.4   Mostly accepted
deep_space          22.8   27.0   30.4   Mix of accepted/tentative
spacecraft          21.1   26.7   28.7   Mix of accepted/tentative
mission_control     16.6   18.7   20.6   Always rejected
diagram             17.3   19.0   23.3   Always rejected

Sigmoid comparison

center=22, scale=3:  Too permissive — spacecraft accepted for every image
center=25.5, scale=1.5:  Clean separation — spacecraft accepted only for hardware images
center=28, scale=1.5:  Too strict — even clear earth images are only tentative

Applicability

Sigmoid calibration applies when:

Does NOT apply when:

Related Lessons