Idempotent Pipeline Design

Idempotent Pipeline Design

The Lesson

Data pipelines fail — downloads timeout, parsers hit unexpected formats, database connections drop. Idempotency (running the same operation twice produces the same result as running it once) must be designed into every layer: delete-before-insert for facts, check-before-insert for dimensions, and grain uniqueness constraints as the final safety net.

Context

The JobClass pipeline downloads, parses, validates, and loads federal labor data from multiple sources into a DuckDB warehouse. Each run processes multiple datasets, any of which can fail independently. A partial failure followed by a restart must not create duplicate rows, corrupt dimension keys, or lose data that was successfully loaded before the failure.

What Happened

  1. Fact tables were loaded with delete-before-insert at release grain. The OEWS loader deletes all existing rows for a given source_release_id before inserting new ones. If the pipeline crashes after the delete but before the insert, the next run deletes nothing (already gone) and inserts fresh. If it crashes after both steps, the next run deletes the partial data and re-inserts completely. Either way, no duplicates:

    -- 1. Delete any existing rows for this release
    DELETE FROM stage__bls__oews_national WHERE source_release_id = ?
    
    -- 2. Insert the new rows
    INSERT INTO stage__bls__oews_national (...) VALUES (...)
    
  2. Dimension tables were loaded with check-before-insert on business key. Dimensions grow over time (new geographies, new occupations) but existing rows are never duplicated or modified. The loader checks for the business key before inserting:

    INSERT INTO dim_geography (geo_type, geo_code, geo_name, ...)
    SELECT ?, ?, ?, ...
    WHERE NOT EXISTS (
        SELECT 1 FROM dim_geography WHERE geo_type = ? AND geo_code = ?
    )
    
  3. Grain uniqueness constraints were added as the final safety net. The fact table's unique constraint on its grain columns rejects duplicates at the database level, catching any loader bug that the delete-before-insert pattern missed:

    UNIQUE (reference_period, geography_key, industry_key, ownership_code,
            occupation_key, source_dataset)
    
  4. Idempotence tests were added to every test suite. Each test loads data, records the row count, loads again with the same input, and verifies the count hasn't changed. This catches regressions in idempotency logic that unit tests of individual SQL statements would miss.

Key Insights

Related Lessons