Lesson 048: Greedy Max-Min Diversity Selection

Lesson 048: Greedy Max-Min Diversity Selection

The Lesson

To select k items that maximally represent the diversity within a group, iteratively pick the item most distant from all already-selected items. This greedy max-min approach is O(n×k), produces near-optimal diversity in practice, and avoids the NP-hard max-dispersion problem entirely.

Context

A cluster of 270 space photographs needed to be summarized with 1 representative image plus 5 diverse images showing the range within the cluster. The representative (closest to centroid) captures "what the cluster looks like," but the diverse set needs to show "how different things in this cluster can be." Randomly sampling 5 images would likely pick 5 similar ones from the dense core. Picking the 5 most distant from the centroid would cluster around the periphery without covering different directions of variation.

What Happened

  1. The cluster had CLIP embeddings (512-dim) for all 270 members and a precomputed distance-to-centroid for each image. The representative was trivial: pick the image with minimum centroid distance.

  2. For diversity, needed to maximize the minimum pairwise distance among selected images — the "max-min dispersion" problem. The exact solution is NP-hard, but the greedy approximation has a known 1/2 approximation guarantee and runs in O(n×k) time.

  3. Implemented the greedy algorithm:

    • Start with the representative in the selected set
    • For each remaining candidate, compute its minimum Euclidean distance to any already-selected image
    • Pick the candidate with the largest minimum distance
    • Repeat k times
  4. Applied across all 25 clusters to generate the "spotlights" view. The algorithm ran in milliseconds per cluster, even for the largest cluster (1,411 images). Total time for all 25 clusters: under 2 seconds including embedding loads from DuckDB.

  5. The diverse selections visually confirmed the algorithm's effectiveness: the 5 images covered different visual themes (e.g., in a "moon surface" cluster: close-up craters, wide-angle terrain, horizon shots, backlit silhouettes, and surface with Earth in background).

Key Insights

Applicability

This algorithm applies to any problem requiring diverse subset selection from a metric space:

Does NOT apply when:

Related Lessons