Lesson 045: Embedding-Based Deduplication for Image Collection Curation

Lesson 045: Embedding-Based Deduplication for Image Collection Curation

The Lesson

When working with a large image collection from an automated source, assume near-duplicates dominate the pool until proven otherwise. Embedding cosine similarity with connected-component grouping reduces a collection to its unique members in minutes, but the threshold choice dramatically affects the result — and the right threshold depends on what "identical" means for your use case.

Context

A calendar image selection project sources 12,217 mission photographs from an automated camera system. Many sequential frames capture nearly the same scene with minor differences in exposure, timing, or spacecraft orientation. Visual clustering (k-means on CLIP embeddings, k=25) groups images by broad visual similarity, but within clusters, hundreds of images may be functionally interchangeable — differing by a fraction of a degree of rotation or a slight change in lighting. This redundancy inflates cluster sizes, distorts preference scoring (identical images split votes), and makes visual review impractical.

What Happened

  1. Noticed that cluster 2 contained 270 images where approximately 250 were visually indistinguishable in thumbnails. Two specific images (ART002-E-26972 and ART002-E-26935) had a CLIP cosine similarity of 0.990 — nearly identical in the embedding space.

  2. Tested different similarity thresholds against cluster 2 to calibrate:

    • 0.99 threshold: 3,966 duplicate pairs
    • 0.98 threshold: 8,655 duplicate pairs
    • 0.97 threshold: 11,906 duplicate pairs
    • 0.95 threshold: 26,908 duplicate pairs
  3. Chose 0.98 as the operating threshold. At 0.99, too many visually identical images survived (different exposures of the same scene often land between 0.98 and 0.99). At 0.97, images that are merely similar — same subject but noticeably different composition — start grouping together, which loses real diversity.

  4. Implemented deduplication using existing CLIP embeddings (512-dim, already computed for clustering):

    • L2-normalize all embeddings to unit vectors
    • Compute pairwise cosine similarity in batches of 1000 rows against all 12,217 columns (~560 MB peak memory)
    • Build a sparse adjacency matrix from above-threshold pairs
    • Find connected components using scipy — this correctly handles transitive chains (AB and BC groups A, B, C together even if A~C is below threshold)
    • For each group, pick the "master" image with the highest voter preference score, tie-breaking on brightness
  5. Results were dramatic: 450 duplicate groups found, 10,054 images suppressed (82% of the collection), 2,163 unique images retained. The largest single group contained 6,810 members — one dominant visual pattern (similar orbital views) that the automated camera captured thousands of times.

  6. Made suppression reversible: a is_suppressed boolean flag on the image dimension table, with restore_all() to undo an entire dedup run. Downstream queries (browse, scoring, optimization, cluster spotlights) filter on this flag. Original data is preserved — suppression is a view filter, not a deletion.

  7. After dedup, individual clusters shrank from hundreds of images to 10-160 unique representatives. Cluster 2 went from 270 to 10 images. This made visual review practical — a human can meaningfully compare 10 images but not 270.

Key Insights

Examples

Before dedup

Cluster Images Practical browsing?
Cluster 2 270 No — 250+ look identical
Cluster 12 1,810 No — overwhelmingly redundant
Cluster 1 1,955 No — scrolling through pages of the same scene

After dedup (threshold 0.98)

Cluster Images Practical browsing?
Cluster 2 10 Yes — each image is visually distinct
Cluster 12 ~80 Yes — meaningful variety visible
Cluster 1 ~120 Yes — reviewable in a few minutes

Threshold sensitivity

Threshold  Pairs    Groups  Suppressed  Retained
0.99       varies   fewer   fewer       more unique images (conservative)
0.98       1.69M    450     10,054      2,163 (chosen)
0.97       more     more    more        fewer (aggressive)

Applicability

This approach works for any collection where automated capture produces redundant entries:

Does NOT apply when:

Related Lessons