Lesson: Choosing k for k-Means Clustering in a Constrained Selection Problem

Lesson: Choosing k for k-Means Clustering in a Constrained Selection Problem

Problem

We needed to cluster 12,217 Artemis II mission photos into visually distinct groups. The clusters serve a specific downstream purpose: ensuring the final 13-image calendar selection has visual diversity. The choice of k (number of clusters) directly affects whether the optimization can produce a good calendar.

Why It Matters

k is the most important hyperparameter in k-means and one of the most commonly hand-waved. Too small and the clusters are too broad — visually different images land in the same group, defeating the diversity guarantee. Too large and many clusters have too few images to offer meaningful choice, and the optimization becomes over-constrained. For a selection problem with a fixed output size (13 images), k needs to be chosen with the output in mind, not just the input.

What Happened

  1. Needed to partition 12,217 Artemis II mission photos into visually distinct groups using CLIP embeddings. The clustering output feeds a downstream optimization: select 13 images for a calendar with guaranteed visual diversity.
  2. Considered using statistical methods (elbow, silhouette, gap statistic) to determine k. Recognized that these methods optimize for data structure — the "natural" number of clusters in the embedding space — which is a different question from what the task needs.
  3. Reasoned from the output constraint: the calendar needs 13 images. With a hard diversity constraint of "no two from the same cluster," k must be ≥ 13. With k=13, every cluster must contribute exactly one image — zero slack for the optimizer.
  4. Applied the heuristic k ≈ √(n/2) ≈ 78 as an upper bound, but pulled well below it for interpretability. Settled on k=25: roughly 2x the output size, giving the optimizer room to skip weak clusters while staying in the 15-30 range where clusters remain human-interpretable.
  5. Ran k-means with k=25 on CLIP visual embeddings. Observed cluster sizes ranging from 77 to 1,411 (median ~400). The skewed distribution confirmed that the embedding space has a few dominant visual themes and many distinctive smaller groupings — informative for selection.
  6. Later validated the choice during multimodal clustering (c494f89): the same k=25 with different feature weights still produced interpretable, actionable groups. The downstream optimization (Phase 4) uses these clusters for both hard constraints (max 2 per cluster) and soft diversity scoring.

The Decision: k=25

Reasoning from the output constraint

The calendar needs 13 images. If we enforce "no two images from the same cluster" as a hard diversity constraint, then k must be ≥ 13. With k=13 we'd be forced to pick exactly one image per cluster — no slack, no room for preference or trade-offs.

With k=25, roughly half the clusters contribute an image to the calendar. This gives the optimizer freedom to skip clusters whose best images are weak, avoid clusters that are too similar to each other, and still guarantee that the 13 selections span at least 13 distinct visual neighborhoods.

Reasoning from the data size

Common heuristics for k-means:

Heuristic Formula k for n=12,217
Square root k ≈ √n 110
Half square root k ≈ √(n/2) 78
Rule of thumb for interpretability k = 20-30 for moderate datasets 20-30

√(n/2) ≈ 78 is an upper bound for maximum statistical granularity. For our use case (interpretable groups for human selection), we want well below this ceiling. k=25 sits in the interpretability range.

Actual cluster size distribution (k=25, visual)

Largest:  cluster 12 — 1,411 images (11.5%)
Median:   ~400 images
Smallest: cluster 22 — 77 images (0.6%)

This spread is informative:

What k=25 does NOT guarantee

How k Interacts with the Calendar Objective

The calendar optimization (Phase 4) will balance multiple objectives:

Clustering feeds the diversity objective. With k=25:

What Would Go Wrong with Different k Values

k Problem
5 Clusters too broad — each cluster mixes visually different images. "Diversity" constraint is nearly meaningless.
10 Barely more clusters than calendar slots (13). Almost no room for the optimizer to trade diversity against preference.
13 Exactly one image per cluster required. No flexibility — the calendar is mechanically determined by which image is "best" in each cluster.
50 Many clusters with < 100 images. Some clusters may be so visually similar that distinguishing them is meaningless at calendar resolution.
100 Clusters become statistically noisy. Many near-empty clusters. Interpretability collapses — you can't look at 100 clusters and understand the visual space.

Broader Lesson

When k-means is used for a downstream decision (not just exploration), choose k based on the decision constraints:

  1. How many distinct groups does the decision need? (For calendar: ≥ 13)
  2. How much slack does the decision-maker need? (For optimization: ~2x the number of selections)
  3. How many groups can a human reviewer interpret? (Rule of thumb: 15-30)

The intersection of these three constraints often gives a narrow range. For Artemis: ≥ 13, ~26, ≤ 30 → k=25.

Formal methods (elbow, silhouette, gap statistic) tell you where the data has structure. But for constrained selection problems, the task determines k more than the data does.