Lesson 020: DuckDB executemany Hangs — Use PyArrow Bulk Insert

Lesson 020: DuckDB executemany Hangs — Use PyArrow Bulk Insert

The Lesson

DuckDB's executemany with parameterized INSERT statements can hang indefinitely at scale (10K+ rows). Replacing it with a PyArrow table and INSERT INTO ... SELECT * FROM tbl completes the same work in under a second. When DuckDB is your warehouse, bulk writes should go through its columnar ingestion path, not its row-at-a-time parameter binding.

Context

A data warehouse built on DuckDB (embedded, single-file) needed to write 12,217 preference scores to a mart table. The write function used conn.executemany(INSERT ..., rows) — the standard Python DB-API pattern for batch inserts. During development with small test datasets (<100 rows), this worked fine. At full scale, the function hung with no error, no timeout, and no progress — the process consumed CPU but never completed.

What Happened

  1. Built a scoring pipeline that computed composite preference scores for 12,217 images. The write step used conn.executemany() with a parameterized INSERT and a list of 12,217 row tuples.
  2. At full scale, the function hung after computing all scores. The process stayed alive (consuming CPU) but produced no output and never returned. Killed and retried three times — same behavior every time.
  3. Suspected DuckDB lock contention (single-writer constraint) and killed all competing Python processes. Removed stale WAL files. Still hung.
  4. Added step-by-step logging around the pipeline. Confirmed the hang occurred specifically at conn.executemany() — all upstream computation completed in <1 second.
  5. Replaced executemany with PyArrow bulk insert: build a pa.table() from the row data, then conn.execute("INSERT INTO mart_table (columns) SELECT * FROM tbl"). DuckDB natively reads PyArrow tables through its columnar scan path.
  6. The PyArrow approach completed in <1 second for all 12,217 rows. The fix was applied to all mart write functions in the project.

Key Insights

Examples

# BEFORE: hangs at 12K+ rows
conn.executemany(
    "INSERT INTO mart_scores (run_id, image_sk, score) VALUES (?, ?, ?)",
    rows,  # 12,217 tuples
)

# AFTER: completes in <1 second
import pyarrow as pa

tbl = pa.table({  # noqa: F841 — referenced by DuckDB SQL
    "run_id": [r[0] for r in rows],
    "image_sk": [r[1] for r in rows],
    "score": [r[2] for r in rows],
})
conn.execute("INSERT INTO mart_scores (run_id, image_sk, score) SELECT * FROM tbl")

Applicability

This applies to any DuckDB write path with 1K+ rows. For smaller datasets (<100 rows), executemany works fine and is simpler. The PyArrow pattern is also useful for any database that supports columnar ingestion (e.g., ClickHouse, Snowflake via Arrow Flight).

Does NOT apply to: PostgreSQL/MySQL where executemany is well-optimized for parameterized inserts, or streaming scenarios where rows arrive one at a time.

Related Lessons