Lesson: Building with Synthetic Data Before Real Data Arrives
Lesson: Building with Synthetic Data Before Real Data Arrives
Problem
The Artemis project needed voter preference data to build its statistical models and calendar optimizer. But real vote data from ArtemisTimeline.com wasn't yet available — the vote export hadn't been requested, and the site's API only exposes aggregate leaderboards, not raw ballots.
Rather than wait, the project generated synthetic vote data with known properties to build and validate the pipeline end-to-end.
Why It Matters
Data pipelines that are built without data tend to be fragile. The developer makes assumptions about data shapes, volumes, and distributions that don't survive contact with real data. Synthetic data lets you build and test the entire pipeline — from ingestion through modeling to output — while the real data is still being sourced.
What Happened
- Needed voter preference data to build statistical models and a calendar optimizer. Real vote data from ArtemisTimeline.com wasn't available — the vote export hadn't been requested, and the site's API only exposes aggregate leaderboards, not raw ballots.
- Wrote a design document (
docs/synthetic_vote_pdr.md) specifying synthetic voter profiles with intentional biases: 60% neutral, 20% visual-drama seekers, 10% position-biased, 10% random. The biases were chosen to test whether the pipeline could detect known patterns. - Implemented a seed-based generator producing three vote types matching the real voting modes: batch ballots (pick 5 from 50), pairwise comparisons (head-to-head), and category rankings (top 3). Each type preserves natural grain — raw votes, not collapsed scores.
- Generated ground truth via deterministic hash-based latent quality scores per image. Timeline images get a +0.15 boost, category showcases get +0.25. Stored in
synthetic_image_truth— kept separate from model inputs, usable only for post-hoc evaluation. - Ran the full pipeline end-to-end: 100 voters, 500 ballots, 2,000 pairwise votes, 250 category rankings. Schema validation passed, all fact tables populated, downstream models (Phase 3) ran successfully against synthetic data.
- The synthetic data became permanent scaffolding — Phase 3 (Elo, Beta-Binomial, Borda) and Phase 4 (calendar optimization) were both developed and validated against it. When real data arrives, the pipeline switches seamlessly because the schema is identical.
What Was Built
Synthetic voter profiles (4 bias types)
BIAS_PROFILES = {
"earth_lover": {"earth_weight": 2.0, "space_weight": 0.5},
"tech_focused": {"hardware_weight": 2.0, "scenery_weight": 0.5},
"balanced": {"all_weights": 1.0},
"contrarian": {"popular_penalty": 0.5, "unpopular_bonus": 1.5},
}
100 synthetic voters with explicit bias profiles. The biases are intentional and known — this lets the bias detection tests verify that the pipeline can detect each type.
Three vote types matching the real voting modes
| Vote type | Real-world source | Synthetic implementation |
|---|---|---|
| Batch ballots | "Pick 5 from 50" | Random batch of 50 images, top 5 by voter's weighted score |
| Pairwise votes | Head-to-head Elo | Random pairs, winner determined by voter's preference |
| Category rankings | "Top 3 in category" | Category-filtered, top 3 by voter's weighted score |
Each type has its own fact table (fact_batch_ballot, fact_pairwise_vote, fact_category_ranking) preserving the natural grain of the vote — not collapsed into a single score.
Ground truth for validation
# synthetic_image_truth: known "true quality" for each image
# Generated from a latent quality model with noise
true_quality = base_quality + category_bonus + random_noise
Because the synthetic data has a known ground truth, the statistical models (Phase 3) can be validated against it: does the Elo system converge to the true ranking? Does the Bayesian model recover the ground truth scores? Do the bias profiles produce detectable patterns?
What This Enabled
- End-to-end pipeline testing — Every pipeline step from extraction through clustering runs on real-shaped data, even though the votes are synthetic
- Schema validation — The vote fact tables, voter dimensions, and session tables were designed and tested before real data existed
- Model development — Phase 3 (statistical modeling) can be developed and tested using synthetic data with known ground truth
- Bias detection development — The intentional bias profiles provide a test harness for the bias detection system
- Performance benchmarking — 100 voters × 500 ballots = 2,500 ballot-image records — enough to profile query performance without real data
Design Decisions
Natural grain preservation
The synthetic generator creates votes at the same grain as the real voting interface:
- Batch ballots: session → batch of 50 → 5 selections (not: per-image scores)
- Pairwise: session → pair → winner (not: win/loss counts)
- Rankings: session → category → ordered list (not: rank scores)
This forces the downstream models to aggregate from raw votes, exactly as they'll need to with real data.
Surrogate voter keys
Voters are identified by voter_sk (surrogate integer) and source_voter_id (salted hash of a synthetic ID). This matches the privacy-preserving design for real voters — the pipeline never stores identifiable voter information.
Seed-based reproducibility
artemis-pipeline generate-votes --seed 42 --voters 100 --ballots 500
Same seed = same synthetic data. This makes tests deterministic and lets developers reproduce issues.
What Synthetic Data Cannot Validate
- Real data quality issues: Missing fields, encoding errors, malformed JSON, duplicate records. Real data always has surprises.
- Distribution assumptions: Synthetic voters follow programmed distributions. Real voters have unpredictable behavior — herding effects, time-of-day biases, mobile-vs-desktop differences.
- Scale-specific issues: 100 voters with 500 ballots is not 10,000 voters with 50,000 ballots. Performance and statistical properties change with scale.
- Data freshness and staleness: The synthetic data is static. Real data arrives in batches, may be delayed, and may be partially updated.
Broader Lesson
When real data isn't available, generate synthetic data that:
- Matches the real schema exactly — same tables, same columns, same types, same grain
- Has known ground truth — so models can be validated against something
- Includes intentional anomalies — so validation checks can be tested
- Is reproducible — seed-based generation for deterministic tests
- Is clearly labeled — the
synthetic_image_truthtable anddim_voter.is_syntheticflag make it impossible to confuse synthetic and real data
The synthetic data should be treated as scaffolding, not as a permanent fixture. When real data arrives, the pipeline should switch seamlessly — if it can't, that reveals assumptions that need fixing.
Related Lessons
- Lesson 039: Mock Tagger for Vision Pipeline Testing — extends the synthetic-first principle to vision model mocking with hash-based deterministic attributes
- Lesson 041: Utility Function for Synthetic Voting Bias — the utility function design for the next generation of synthetic vote generation with attribute-based bias
- Lesson 044: Acceptance Tests as Executable Specifications — the full-pipeline test fixture is a self-contained synthetic environment that validates end-to-end behavior