Lesson 050: Connected Components for Transitive Deduplication

Lesson 050: Connected Components for Transitive Deduplication

The Lesson

When deduplicating by pairwise similarity, use graph connected components to group items — not naive pair-based merging. Pairwise similarity is not transitive in theory (AB and BC doesn't guarantee A~C), but for near-duplicates in practice, transitivity holds and connected components correctly group chains that pair-based approaches would split.

Context

A 12,217-image collection needed deduplication before calendar selection. CLIP embeddings (512-dim) were already computed. The goal: find groups of "functionally identical" images (cosine similarity ≥ 0.98), pick one master per group, and suppress the rest. The challenge was grouping — many duplicate chains exist where AB at 0.99 and BC at 0.99 but A~C at only 0.97.

What Happened

  1. Computed pairwise cosine similarity in batches: 1000 rows × 12,217 columns per batch, thresholded at 0.98. Found 1,691,086 above-threshold pairs in 1.1 seconds using numpy batch matrix multiply on normalized vectors.

  2. First considered naive pairwise grouping: for each pair (A,B), merge their groups. This is union-find, which is correct — but implementing it from scratch is error-prone. scipy provides connected_components on sparse graphs, which solves the same problem with a single function call.

  3. Built a sparse adjacency matrix from the 1.69M pairs using scipy.sparse.csr_matrix. Made it symmetric (added both i→j and j→i edges). Called scipy.sparse.csgraph.connected_components(adj, directed=False).

  4. Result: 2,163 connected components, of which 450 had 2+ members (duplicate groups). The largest group had 6,810 members — a massive chain of near-identical orbital photographs.

  5. For each multi-member group, selected the master image by highest voter preference score (tie-break on brightness), computed each member's similarity to the master, and stored the grouping in database tables.

  6. The 6,810-member mega-group demonstrates why connected components matter: no single pair in that group necessarily has similarity ≥ 0.98 to every other member. The group is connected through chains of pairwise similarity, where each link in the chain is above threshold but end-to-end members may be well below. This is correct for the use case — the images are all "variations on a theme" even if the extremes are visually distinct.

Key Insights

Examples

Transitive chain

Image A ←0.99→ Image B ←0.99→ Image C
   ↑                                ↑
   └───────────0.97─────────────────┘

Pair-based at 0.98:  {A,B} and {B,C} — B in both groups (broken)
Connected components: {A,B,C} — one group, one master (correct)

Scale characteristics

Images:     12,217
Pairs > 0.98: 1,691,086
Components:   2,163 (450 with 2+ members)
Largest:      6,810 members
Compute time: 1.1 seconds (similarity) + 0.3 seconds (components)

Applicability

Connected components for deduplication work when:

Does NOT apply when:

Related Lessons