Lesson: Debugging with Surrogate Key Ranges

Lesson: Debugging with Surrogate Key Ranges

Problem

When investigating why multimodal clustering produced zero results, the breakthrough came from a simple query:

SELECT min(image_sk), max(image_sk) FROM feature_image_embedding;
-- min=13775, max=25991

SELECT min(image_sk), max(image_sk) FROM feature_description_embedding;
-- min=13256, max=13774

The ranges are perfectly adjacent and non-overlapping. This immediately revealed that the two embedding tables represent entirely different image populations — before any join logic was even examined.

Why It Matters

Surrogate keys (auto-incrementing integers assigned during data loading) carry implicit information about when and how data was loaded. Two populations with non-overlapping key ranges were loaded at different times or from different sources. This is not something the key was designed to tell you, but it's enormously useful for debugging.

The Technique

When a join returns fewer rows than expected (or zero), check the key ranges of both sides:

-- Step 1: Key ranges
SELECT 'table_a' as src, min(key), max(key), count(DISTINCT key) FROM table_a
UNION ALL
SELECT 'table_b' as src, min(key), max(key), count(DISTINCT key) FROM table_b;

Possible findings:

Step 2: Characterize the populations

Once you know the ranges, describe what's different about the entities in each range:

SELECT
    CASE WHEN image_sk BETWEEN 13256 AND 13774 THEN 'text_range'
         WHEN image_sk BETWEEN 13775 AND 25991 THEN 'clip_range' END AS population,
    count(*) as n,
    count(title) as has_title,
    sum(CASE WHEN thumb_downloaded THEN 1 ELSE 0 END) as has_thumb
FROM dim_image
GROUP BY 1;

This query immediately showed: text-range images have titles but no thumbnails; clip-range images have thumbnails but no titles. The join's empty result was a data fact, not a code bug.

When This Technique Applies

What Surrogate Keys Tell You (Implicitly)

Observation Likely Cause
Non-overlapping ranges Different source datasets or load batches
Gaps in sequence Rows deleted or filtered during load
One range much larger One source is much larger than the other
Interleaved ranges Data loaded in alternating batches or from a merged source
Identical ranges Same population, different feature extractions

Limitations

The Broader Point

When debugging data pipeline issues, look at the data before looking at the code. The shapes, ranges, and distributions of your tables often make the problem obvious without reading a single line of pipeline logic. Queries like min/max/count, GROUP BY on flags, and INTERSECT/EXCEPT on key columns are the fastest diagnostic tools available.