Lesson: Per-Record Overhead That Doesn't Matter at 10 Rows Kills You at 12,000

Lesson: Per-Record Overhead That Doesn't Matter at 10 Rows Kills You at 12,000

Problem

The original thumbnail downloader worked flawlessly on 5 images during development. When scaled to 12,217 images, it was unacceptably slow — not because of network latency, but because of per-image overhead that was invisible at small scale.

Why It Matters

This is the most common performance trap in data pipelines: code that's correct and fast at development scale becomes a bottleneck at production scale. The per-record overhead is often in "bookkeeping" operations that seem cheap individually but multiply catastrophically.

What Happened

  1. Wrote the thumbnail downloader and tested it on 5 images during development. Total runtime: ~2-3 seconds. No performance concerns. The code was clean, correct, and well-structured with proper audit trails.
  2. Attempted to run it on the full dataset of 12,217 images. Projected runtime: 32-62 minutes. Killed the process and investigated.
  3. Profiled a single iteration. Found 155-305ms of per-image overhead — connection setup (50-200ms), four DB operations (4-12ms), and rate-limit sleep (100ms) — against only 20-50ms of useful work (downloading a 20KB file). The overhead-to-work ratio was 3-15x.
  4. The overhead was invisible at 5 images because the total overhead (0.75-1.5 sec) was indistinguishable from the useful work time (0.1-0.25 sec). At 12,217 images, the 2,443x multiplier turned seconds into tens of minutes.
  5. Eliminated overhead in three categories: shared HTTP client (saved 10-40 min of TLS handshakes), batch DB writes (saved ~2 min of round-trips and eliminated 48,000+ run_manifest rows), and concurrent workers replaced explicit sleep (saved 20 min).
  6. After fixes, the download completed in 2.7 minutes for 7,798 images. The profiling pattern (measure overhead separately from useful work, compute overhead_ratio) became a standard pre-scaling check for subsequent pipeline stages.

The Overhead Inventory

For each of the 12,217 thumbnails, the original downloader performed:

Operation Time per call Total for 12,217 Purpose
New httpx.Client (TLS handshake) 50-200ms 10-40 min HTTP connection
generate_run_id() (UUID) <0.1ms negligible Audit trail
create_run_record() INSERT 1-3ms 12-36 sec Audit trail
update_run_status() UPDATE 1-3ms 12-36 sec Audit trail
dim_image UPDATE 1-3ms 12-36 sec State tracking
run_manifest UPDATE (checksum) 1-3ms 12-36 sec Audit trail
time.sleep(0.1) 100ms 20 min Rate limiting

Total per-image overhead: ~155-305ms (excluding network I/O) Total for 12,217 images: ~32-62 min of pure overhead

The actual network download of a 20KB thumbnail takes ~20-50ms. The overhead is 3-15x the useful work.

What Made It Invisible at Small Scale

At 5 images (the initial test):

At 12,217 images: every one of these costs multiplies by 2,443x. The 0.5-second TLS overhead becomes 10+ minutes. The "free" run_manifest records become 36 seconds of DB writes.

The Fixes (and Their Multipliers)

1. Shared HTTP client (biggest win)

# Before: 12,217 TLS handshakes
for guid in guids:
    with httpx.Client() as client:  # new connection each time
        client.get(url)

# After: 1 TLS handshake, reuse connection
with httpx.Client(http2=True) as client:
    for guid in guids:
        client.get(url)  # reuses existing connection

Savings: 10-40 minutes

2. Batch DB writes (second biggest win)

# Before: 4 DB operations per image = 48,868 operations
for guid in guids:
    create_run_record(conn, ...)
    # ... download ...
    update_run_status(conn, ...)
    conn.execute("UPDATE dim_image SET ...")
    conn.execute("UPDATE run_manifest SET ...")

# After: 1 run record + 1 batch UPDATE = ~7 operations
create_run_record(conn, ...)  # one record for entire batch
# ... all downloads ...
conn.execute("UPDATE dim_image SET ... WHERE source_image_id IN (...)", [all_guids])
update_run_status(conn, ...)  # one update

Savings: ~2 minutes of DB I/O, plus eliminates 48,000+ run_manifest rows

3. Eliminate sleep (obvious but requires thought)

# Before: 0.1s sleep per image
time.sleep(RATE_LIMIT_R2_CDN)  # 0.1s * 12,217 = 20 min of sleeping

# After: concurrent workers provide implicit rate limiting
# 3 workers with connection pool = ~3 concurrent requests
# No explicit sleep needed

Savings: 20 minutes of sleeping

The Pattern: Audit Trail Overhead

The run_manifest records are a recurring pattern. The manifest design was correct for tracking pipeline runs (a metadata collection run, a load batch), where each run is a discrete, meaningful event. But applying the same per-event tracking to 12,217 identical thumbnail downloads produces a table with 12,217 rows that all say the same thing: "downloaded a thumbnail from R2."

When per-record tracking is worth the cost

When to batch

Broader Lesson

Before scaling a pipeline from development to production, profile the per-record overhead separately from the per-record useful work:

import time

# Instrument one iteration
t0 = time.time(); setup(); t_setup = time.time() - t0
t0 = time.time(); do_work(); t_work = time.time() - t0
t0 = time.time(); bookkeep(); t_book = time.time() - t0

overhead_ratio = (t_setup + t_book) / t_work
projected_overhead = (t_setup + t_book) * total_records
print(f"Overhead ratio: {overhead_ratio:.1f}x")
print(f"Projected overhead at {total_records} records: {projected_overhead/60:.0f} min")

If the overhead ratio is > 1x, you'll spend more time on bookkeeping than on useful work. That's the signal to batch.