PPR Engine Benchmark — 11 Queries (690K rows, May 2026)

📋 What This Benchmark Tests

This benchmark compares DuckDB, Polars, Pandas, Dask, and Zig against Ireland's Property Price Register — 783K residential property transactions, 92 MB, spanning 2010–May 2026 (fresh from the weekly cron job). Each engine runs 11 identical analytical queries under a 4-core constraint on a Proxmox LXC container (16 GiB RAM, 512 MB swap).

3 categories of stress: Standard aggregations (Q1–Q6: GROUP BY, window functions, string ops), compute-intensive (Q7–Q8: self-joins, YoY shifts via lag), and IO/hash-table intensive (Q9–Q11: 5D GROUP BY, quantile distributions, dense ranking). Designed to push each engine to its architectural limits — Python-boundary overhead, distributed shuffle costs, in-memory vs mmap tradeoffs.

Key takeaways: Zig dominates (0.46s) — compiled to a native binary with zero overhead. Polars leads Python engines (1.03s). DuckDB is 3× slower than Zig but 8× more memory efficient than Pandas. Dask is 24× slower than Zig — at 92 MB, distributed compute is pure overhead. The memory watchdog prevented yesterday's swap-crash from recurring.

783,755

Transactions

92 MB

CSV Data

Queries

700K

Repeat-sale pairs

2010–2026

Date Range (fresh weekly cron)

16 GiB

RAM (LXC)

512 MB

Swap

11-Query Benchmark · Full national PPR (dashboard cron, fresh through May 2026)

Rank	Engine	Ingest	Queries	Total	× Zig	Peak RSS
🥇	Zig	0.305s	0.158s	0.464s	1.0×	0.8 GiB
🥈	Polars	0.096s	0.929s	1.025s	2.2×	0.5 GiB
🥉	DuckDB	0.265s	1.207s	1.472s	3.2×	0.1 GiB
	Pandas	1.373s	3.177s	4.550s	9.8×	1.0 GiB
	Dask	1.085s	10.082s	11.168s	24.1×	0.8 GiB

Total Time by Engine — 11 Queries

Ingest Time

Query Execution Time

Query Breakdown — Stress Profile

#	Type	Description	Stress Target
Q1–Q5	Standard	Avg by county, monthly volume, top-10 median, price histogram, county×year pivot	Baseline aggregation
Q6	String + agg	Address normalization + duplicate detection	String ops, Python boundary
Q7	Self-join	Price turnaround: first vs last sale per property	CTE + self-join on 690K rows
Q8	Window	Year-over-year price change per county	Window shift/lag over partitions
Q9	5D GROUP BY	County × year × quarter × bucket × size class	Hash table with ~75K+ groups
Q10	Quantile	Percentile distribution (P10–P90) per county/year	Ordered aggregation × 5 quantiles
Q11	Dense rank	Top-5 and bottom-5 per county per year	Window sort per partition

⚡ Key Findings

Zig dominates at 0.464s — 2.2× faster than Polars and 3.2× faster than DuckDB. All 11 queries run in 0.158s of pure computation. Zig compiles to a single binary with zero runtime overhead, no query planner, no garbage collector — just hash maps and arrays. Ingest is slower than Polars (0.305s vs 0.096s) because CSV parsing is hand-written; a faster parser would push Zig below 0.3s total.

Polars leads the Python engines at 1.025s — Rust-native CSV ingestion (0.096s) keeps it ahead of DuckDB. The query time gap (0.929s vs 1.207s) is narrow; Polars wins on faster parsing.

DuckDB dominates memory efficiency — 0.1 GiB RSS vs Polars' 0.5 GiB. DuckDB processes everything in a single in-process SQL engine with no intermediate copies. Pandas peaks at 1.0 GiB due to .copy() calls and intermediate DataFrames.

Dask is 24× slower than Zig — 11.2s with the threaded scheduler. Even with processes=False, Dask's shuffle scheduler triggers on every GROUP BY — data is re-partitioned even within a single process. At 92 MB, Dask's distributed overhead is ~9s of pure tax. On a 10 GB+ dataset this would amortize; at this scale it's the wrong tool.

🛡️ Memory Safety — Lessons Learned (the hard way)

Yesterday's crash: Running Dask with 4 workers and no memory cap on a 16GiB LXC with 512MB swap → OOM kill. Lost ~5 in-progress expanded queries that were never saved to disk.

Fixes applied to the benchmark script:

• RLIMIT_AS set to 4× physical RAM (not 1× — DuckDB uses mmap for virtual address space, and a tight cap causes spurious "memory allocation of N bytes failed" crashes even with 14GiB free)

• psutil watchdog checks RSS after every engine and aborts at 85% (MemoryError before swap)

• Dask: 2 workers, memory_limit='4GiB' — kills a worker before it OOMs the machine

• gc.collect() between engines to prevent memory accumulation

• Backup first: always cp script.py script.py.bak.$(date +%Y%m%d_%H%M%S) before running. The expanded ~11-query version was lost when the machine swap-crashed.

📊 Query Stress Analysis

Q9 (5D GROUP BY) — DuckDB's hash table handles this natively. Polars' grouped aggregation is vectorized. Pandas creates a single wide table. Dask shuffles across workers.

Q10 (Percentiles) — DuckDB's QUANTILE_CONT is fastest (the Q7-CTE-style approach worked cleanly). Polars' .quantile() per group is slightly slower. Pandas uses lambda-per-group which is slower. Dask must compute to Pandas for quantiles.

Q11 (Dense rank) — All engines handle ranking well. DuckDB: DENSE_RANK() OVER(). Polars: .rank("dense").over(). Dask: .rank() triggers a shuffle. This query stresses the sort/partition path.

PPR Engine Shootout — 11 Queries