Lesson 030: Reliability Delta as Noise Measurement

Lesson 030: Reliability Delta as Noise Measurement

Problem

We know that 20% of synthetic voters are intentionally noisy (10% position-biased, 10% random). We compute Krippendorff's alpha on all voters and get a moderate value (~0.52). But how much of the low agreement is caused by these noisy voters vs. genuine preference diversity among neutral voters? We need to isolate the noise contribution.

Why It Matters

Agreement metrics on a mixed population conflate two things: genuine preference diversity (some people like dramatic images, others like crew photos) and noise (random voters and position-biased voters who aren't responding to image quality at all). If you remove the noise and alpha barely changes, the low agreement reflects real diversity — which is fine. If removing the noise dramatically increases alpha, the noise is masking consensus — which means the preference scores are less reliable than they appear.

What Happened

  1. Computed Krippendorff's alpha on all 100 synthetic voters' batch ballot data. Got a moderate alpha reflecting the mix of signal and noise.
  2. Recomputed alpha on the same data but excluding voters with synthetic_profile_code IN ('position_biased', 'random'). This removes 20% of voters.
  3. Compared the two: alpha_delta = alpha_clean - alpha_all. A positive delta means the excluded voters were lowering agreement (injecting noise). The magnitude tells you how much.
  4. Reused the existing krippendorff_alpha_nominal function from models/reliability.py for both computations. No new algorithm — just a different filter on the input matrix.
  5. The key design decision was which voters to exclude. We exclude by profile code (available in dim_voter.synthetic_profile_code) rather than by behavior detection, because this is validation — we're checking whether the system can distinguish signal from noise, so we use the ground truth labels.
  6. In a real-data scenario (no profile codes), the same technique applies but the exclusion set comes from behavioral detection: voters flagged by the position-bias test or low-agreement scores. The validation step here proves the concept; the production step would close the loop.

Design Choice: Delta Over Absolute

Why report the delta, not just the clean alpha?

Alpha alone is hard to interpret — is 0.52 good or bad? The answer depends on the domain. But a delta of +0.08 (from 0.52 to 0.60) is interpretable: 15% of the disagreement was caused by noisy voters. This is actionable — you can decide whether 15% noise is acceptable or whether you need to weight-down suspected noisy voters in the preference scoring.

Why not build noise detection into the scoring pipeline?

The validation module detects noise and reports it. The scoring pipeline could incorporate noise-adjusted weights (downweight suspected noisy voters). We deliberately kept these separate: validation proves the noise exists and measures its impact; the scoring pipeline is where you'd act on it. Mixing detection and correction in one step makes it impossible to know whether the correction helped.

Key Insights

Applicability

This pattern applies to any system measuring agreement in a heterogeneous population:

Does NOT apply when:

Related Lessons