Lesson 041: Utility Function Design for Synthetic Voting Bias
Lesson 041: Utility Function Design for Synthetic Voting Bias
Problem
Synthetic vote generation needs to produce votes that exhibit detectable attribute-based bias while remaining statistically plausible. A biased voter block that always votes for images with specific attributes produces trivially detectable (and unrealistic) bias. A block with too much noise produces undetectable bias. The utility function must allow precise control over bias strength relative to noise.
Why It Matters
The bias detection pipeline can only be validated if the synthetic data contains known, tunable biases. If the utility function is too simple (binary: prefer or not), the resulting bias is either overwhelming or absent. If it's too complex (multi-factor interactions, non-linear preferences), the relationship between config parameters and observable bias becomes opaque, making it impossible to write meaningful acceptance tests.
What Happened
- Designed a three-component utility function:
utility = base_appeal + preference_weight * match_score + randomness_weight * noise. base_appealis deterministic per image (SHA-256 hash of image SK + seed → float in [0, 1]). This gives every image a fixed "inherent quality" that all voters agree on, preventing purely random outcomes.match_scoremeasures how well an image matches a block's attribute rules. It uses_compute_attribute_match: +1 per satisfiedall_ofrule, +1 if anyany_ofrule matches, -0.5 penalty per violatednone_ofrule, normalized by max possible score.noiseisrng.gauss(0, 0.3)— Gaussian noise to simulate voter disagreement and attention variability.- The two weights (
preference_weightandrandomness_weight) control the balance. In the test config: biased blocks usepreference_weight=2.2–2.6withrandomness_weight=0.30–0.40, while the neutral control usespreference_weight=0.0withrandomness_weight=1.0. - Initial testing showed that
preference_weightneeded to be well above 1.0 (2.0+) to produce detectable lift, because the match score is normalized to [0, 1] and the noise standard deviation of 0.3 can easily mask a small preference bump. - The
none_ofpenalty is asymmetric (-0.5 instead of -1.0) because aversion is typically weaker than preference in real voting behavior — voters gravitate toward what they like more strongly than they avoid what they don't.
Design Choice: Additive Utility Over Multiplicative or Threshold-Based
Why additive
Three alternatives:
- Multiplicative:
base_appeal * (1 + preference_weight * match). Problem: images with low base appeal stay low regardless of preference, which means biased blocks still select similar images to neutral blocks — the bias is muted. - Threshold-based: "If match > 0.5, always select." Problem: produces binary bias that's trivially detectable and unrealistic. Real voters have preferences, not rules.
- Additive: Each component contributes independently. A strong match can overcome low base appeal, and noise can occasionally override preference. The weights directly control how much each component matters.
Why separate weights instead of a single bias strength
A single "bias strength" parameter would conflate two independent dimensions: how much the voter cares about attributes (preference) and how noisy their decisions are (randomness). Separating them allows the test config to express "strong preference but high noise" (partially detectable bias) separately from "weak preference but low noise" (subtle but consistent bias). This matters for testing detection sensitivity.
Why Gaussian noise instead of uniform
Gaussian noise has tails — occasionally a voter makes an extreme choice. Uniform noise is bounded, which makes the utility function's behavior more predictable but less realistic. The sigma of 0.3 was chosen empirically: small enough that preference_weight=2.5 dominates most decisions, large enough that the top-5 selection from a batch of 50 occasionally includes non-preferred images.
Key Insights
- Additive utility with separate weights gives orthogonal control.
preference_weightcontrols signal strength;randomness_weightcontrols noise amplitude. A test author can reason about each independently: "doubling preference_weight roughly doubles the lift" and "halving randomness_weight roughly halves the variance." - Base appeal prevents degenerate selections. Without a shared quality signal, neutral voters select uniformly at random, and biased voters select only from their preferred pool. The base appeal creates a realistic quality gradient that all voters partially agree on, so biased voters select mostly from their preferred pool but with some overlap with neutral voters' picks.
- Match score normalization to [0, 1] makes weights interpretable. A
preference_weightof 2.0 means "add up to 2.0 utility for a perfect match." Without normalization, the weight's meaning would depend on how manyall_ofrules a block has. - Asymmetric none_of penalty reflects real behavior. Setting
none_ofto -0.5 instead of -1.0 means aversion is weaker than attraction. This matches empirical voting: people who prefer Earth images might slightly avoid deep-space images, but the avoidance is milder than the preference. - Empirical tuning of the preference-to-noise ratio is unavoidable. Theory can guide the shape of the utility function, but the specific weight values that produce detectable-but-not-obvious bias depend on the batch size (50 shown, 5 selected), the number of voters per block, and the attribute distribution across images. The config YAML makes this tuning explicit and reproducible.
Applicability
This utility function pattern applies to any synthetic data generation that needs controlled, tunable signal-to-noise ratios:
- A/B test simulators where treatment effect size must be parameterized
- Recommendation system evaluation with planted relevance signals
- Fraud detection testing with known fraudulent patterns at configurable rates
Does NOT apply when:
- You need to model strategic voting (utility functions assume sincere preferences)
- The voting mechanism is pairwise or ranked rather than batch selection (different noise model needed)
- Real voter behavior data is available and should be replayed rather than simulated
Related Lessons
- Lesson 011: Synthetic Data Before Real Data — the broader principle; this lesson covers the specific utility function design within synthetic generation
- Lesson 028: Chi-Squared for Bias Detection — the detection side of the same coin; the utility function plants bias, chi-squared tests detect it
- Lesson 040: Controlled Vocabulary as Schema Contract — the match score depends on attribute codes being consistent between the vocabulary, the config, and the DB