Lesson 013: Elo Rating for Image Comparison

Lesson 013: Elo Rating for Image Comparison

Problem

The Artemis pairwise voting mode shows two images side by side and asks "which is better?" This produces binary outcomes (winner / loser) for specific pairs, not absolute ratings. We need to convert these relative comparisons into a single continuous strength score per image that can be combined with other preference signals.

Why It Matters

Pairwise comparisons are the strongest form of preference data — voters make a direct choice between two specific alternatives, eliminating the ambiguity of batch voting (where "not selected" could mean "not seen" or "seen but beaten by 5 stronger options"). However, pairwise data only covers a tiny fraction of possible image pairs, so the conversion to scores must be robust to sparsity.

What Happened

  1. Had 2,000 pairwise comparison votes (synthetic) across 12,217 images. Each vote is a binary outcome: image A beats image B. Needed to convert these relative comparisons into absolute strength scores.
  2. Evaluated Bradley-Terry-Luce (BTL) first as the theoretically superior model. Discovered that the comparison graph was overwhelmingly disconnected — at most ~4,000 images appear in any comparison, and the remaining 8,000+ are isolated nodes. BTL's MLE requires graph connectivity; it couldn't converge.
  3. Chose standard Elo as the pragmatic alternative. Elo processes comparisons sequentially, assigning a default rating (1500) to first-seen images. It degrades gracefully with disconnection — scores are meaningful within connected components, just not perfectly calibrated across them.
  4. Set K=32 as a moderate K-factor. Higher K (64) would converge faster but amplify noise from synthetic vote patterns. Lower K (16) would be more stable but needs more data to differentiate images.
  5. Processed 2,000 comparisons in database insertion order (mirroring vote submission order). Produced Elo ratings for 394 distinct images. The remaining 11,823 images have NULL Elo scores — honest missing data, not zero.
  6. Elo scores enter the composite formula as a 15% quantile-ranked adjustment to the Beta posterior backbone. The modest weight reflects sparse coverage — only 3.2% of images have Elo data.

Design Choice: Standard Elo with K=32

We use the classic Elo rating system with K-factor 32 and starting rating 1500.

Key terms

Why this matters for Elo vs. BTL

Elo processes comparisons sequentially and assigns a default rating (1500) to first-seen images. This means images in disconnected components get comparable starting ratings and diverge based on local win/loss patterns — the scores are meaningful within a component but not perfectly calibrated across components.

Bradley-Terry-Luce (see lesson 014) tries to solve this by fitting all parameters simultaneously, but it mathematically requires connectivity. Elo's sequential approach is more forgiving of disconnection.

Alternatives Considered

  1. Win rate: Raw wins / (wins + losses). Simple but doesn't account for opponent strength.
  2. Glicko-2: Elo variant that tracks rating uncertainty. More principled but adds complexity we don't need with synthetic data.
  3. TrueSkill: Microsoft's Bayesian rating system. Designed for multiplayer games; overkill for pairwise comparisons.
  4. Bradley-Terry-Luce: Maximum likelihood estimation. Deferred due to disconnected graph (see lesson 014).

What Was Learned

Elo is the right default for pairwise comparison data — it's simple, well-understood, and degrades gracefully with sparse data. The main caveat is that Elo ratings are sensitive to the order comparisons are processed. With 2,000 comparisons processed in database insertion order (which mirrors vote submission order), this ordering effect is minimal. K=32 was chosen as a balance between responsiveness and stability — with more data, K=16 would be more appropriate to reduce noise.