Bulk Metadata Enrichment Scripts
Bulk Metadata Enrichment Scripts
The Lesson
When hundreds of data records need the same type of update (adding titles, categories, tags, or enriched descriptions), writing a dedicated Python script that reads a manifest and patches the data files is orders of magnitude faster and more reliable than manual editing. The script is disposable, but the pattern is reusable.
Context
The certification quiz had 10 new exam files (7 files in one batch) where each question needed: a title, a scenario, category references, difficulty level, and tags. With 50 questions per file, that's 350 questions × 5 fields = 1,750 individual edits. Manual editing was impractical.
What Happened
- A
patch_metadata.pyscript was written to read a JSON manifest specifying per-question metadata - Manifest files were created for each exam (e.g.,
scripts/manifests/aip-c01.json) with question IDs and their metadata - The script parsed each XML file, located questions by ID, and inserted/updated the specified metadata elements
- The process was applied to all 10 files across 3 commits, each covering a batch of 2-3 exams
- Post-patch validation confirmed all files remained valid XML and loaded correctly
Key Insights
- Separate the data from the transformation. The manifest (JSON) describes what to change. The script (Python) describes how to change it. This separation means the same script works for any exam file — only the manifest changes.
- Manifests are reviewable. A JSON file listing question IDs and their categories/tags can be reviewed by a domain expert who doesn't read Python. The transformation logic is separate from the content decisions.
- XML manipulation in Python requires care. Python's
xml.etree.ElementTreedoesn't preserve formatting, comments, or processing instructions. Libraries likelxmlpreserve more, but even they can reformat whitespace. The script should validate output against the schema, not against the original formatting. - Batch sizes should match review capacity. Doing all 10 files in one commit makes review impossible. Batching 2-3 files per commit keeps diffs reviewable while maintaining efficiency.
- The script is a one-time tool but the pattern recurs.
patch_metadata.pywas used once for this specific enrichment. But the pattern (manifest + transformation script + validation) was reused for hint enrichment, answer randomization, and schema conversion.
Related Lessons
- Content Quality Auditing at Scale — the audit that identified which files needed enrichment, driving the bulk script's work queue
- Answer Position Bias — randomization used the same pattern: script + validation + batch execution across all files
- Schema Variant Consolidation — another bulk transformation script that converted incompatible schemas to canonical form