Content Quality Auditing at Scale

Content Quality Auditing at Scale

The Lesson

When you have hundreds or thousands of content items authored by different sources at different times, quality varies wildly unless you define measurable thresholds and audit systematically. The audit itself is more valuable than the fixes it produces — it turns "the hints feel thin" into "22 of 33 files have H2 averages below 100 characters."

Context

A certification quiz application had 33 exam files containing 1,650 questions, each with three progressive hints (Brief Hint, Complete Explanation, Deep Knowledge). Hints were authored by multiple contributors/processes over several months. Some had rich, paragraph-length explanations; others had 2-3 word fragments.

What Happened

  1. A reference file (data/aws/aif-c01.xml) was identified as the gold standard: H1=110, H2=380, H3=294 average characters
  2. Minimum thresholds were defined: H1 >= 80 chars, H2 >= 250 chars, H3 >= 200 chars
  3. Every file was audited programmatically against these thresholds
  4. Files were classified into tiers: Invalid Schema (4), Critical (6), Needs Work (12), Reference Quality (11)
  5. Enrichment was done tier-by-tier over multiple commits, with each file re-audited after enrichment

Key Insights

Related Lessons