XML Entity Encoding Pitfalls

XML Entity Encoding Pitfalls

The Lesson

XML entity encoding bugs (Q&A vs Q&A) are the most common class of data corruption in XML content pipelines. They're invisible in many editors, they pass casual visual inspection, and they cause parse failures that manifest as "the file won't load" with no useful error message. Any pipeline that produces or transforms XML content must have automated validation.

Context

The certification quiz application stored exam questions in XML files. Multiple bug-fix commits addressed entity encoding issues:

What Happened

  1. Six Azure exam XML files were added to the project. Two of them (AZ-400, AZ-500) contained unescaped & characters in question text — phrases like "Q&A" instead of "Q&A".
  2. The XML parser failed silently in the browser: DOMParser returned a parsererror document instead of the exam, and the quiz showed a generic "failed to load" message with no indication that an & on line 247 was the cause.
  3. A fix commit corrected the two files manually. No automated check was added at this point.
  4. Months later, the same bug recurred in Databricks exam files — questions 8 and 42 had Q&A in their scenarios. Another manual fix commit followed.
  5. A validate-xml.js script was then written to parse all XML files and report errors, catching this class of bug before it reached users.
  6. The eventual migration from XML to JSON eliminated this entire problem class — JSON strings handle & natively with no escaping required.

Key Insights

Examples

Broken XML (causes parse failure):

<scenario>Your team uses a Q&A forum for knowledge sharing.</scenario>

Fixed XML:

<scenario>Your team uses a Q&amp;A forum for knowledge sharing.</scenario>

The only visible difference is & vs &amp;. Both render identically to the user, but the first one kills the XML parser.

Related Lessons