Popular searches
//

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

20.1.2026 | 14 minutes reading time

Update 02.02.26 – After helpful insights from the Polars team on LinkedIn, we enhanced our benchmark setup with a configuration of Polars where async is forced. This is elaborated in the article.

Our previous benchmark compared DuckDB, Polars, and Pandas on a 9 GB CSV dataset. That's a laptop-scale problem. What happens when analytical workloads reach hundreds of gigabytes or terabytes—scales where memory constraints become critical and file layout decisions matter?

This article stress-tests DuckDB and Polars on datasets up to 2 TB, deliberately exploring adversarial scenarios: very large Parquet files (140 GB each), collections of small files (72 × 2 GB), and combinations of both. Rather than identifying a universal winner, we examine how execution models respond to extreme memory pressure and file organization choices.

By holding the logical query pattern constant and varying only dataset size and physical layout, we isolate the effects of file organization and scale on performance. This enables a focused comparison of execution strategies under heavy, scan-intensive workloads where memory management becomes the critical bottleneck.

Methodology

Benchmark Setup

We use the TPC-H lineitem table, an industry-standard benchmark representing order line items with 16 columns including dates, quantities, prices, and categorical fields. This provides a reproducible baseline with realistic schema complexity and data distributions. Individual Parquet files range in size from approximately 2 GB to 140 GB.

All queries follow the same logical pattern: a Parquet scan with projection to three columns, selective filters, and a final row count. The workload is scan-heavy and filter-and-aggregate oriented. It does not evaluate join-heavy or shuffle-intensive queries. Both systems use their default multithreaded execution settings. Filter selectivity was kept constant across all experiments to ensure comparability.

Experimental Environment

All tests were performed on a 2021 MacBook Pro (M1 Max, 32 GB RAM) using Python. We developed a custom benchmarking tool to ensure consistent and reproducible command-line execution, allowing users to specify both the engine and the operation under test. Results were visualized using matplotlib, and key statistical metrics were reported.

Memory Measurement

Initial experiments measured memory usage as the difference in resident set size (RSS) between the start and end of each query. This approach misled us because peak memory usage often occurs mid-execution, temporary buffers may be freed before completion, and operating system caching can distort final RSS measurements.

In this benchmark, memory usage is defined as the maximum RSS observed during query execution, measured relative to the starting RSS. This captures the peak incremental working set required by the query rather than the total process footprint.

We sampled process memory at short intervals of 0.05 seconds and recorded the highest observed value. Each test executed in a fresh process to ensure a consistent baseline. While operating system page caching could not be fully eliminated, this setup reduces persistent in-process caching effects. Every configuration ran ten times, and reported results represent average values.

Memory Measurement Nuances: The mmap Effect

Polars uses memory-mapped I/O (mmap) by default for Parquet files on local storage. With mmap, the operating system loads file pages into memory on-demand (lazy loading) and can reclaim them if other processes need memory. This appears as high memory usage in process monitoring tools but differs from permanently allocated memory that cannot be reclaimed.

Our memory measurements capture resident set size (RSS), which includes mmap-backed pages. This represents memory occupied by the process but not necessarily owned exclusively—the OS can evict these pages if needed. For benchmarking purposes, we report peak RSS as this reflects real memory pressure on the system, but we note when mmap significantly contributes to observed usage.

Dataset Generation

The dataset used in this benchmark was generated using DuckDB’s tpch extension, which implements both the data generator and the reference queries for the TPC-H benchmark. The extension is shipped by default with most DuckDB builds and supports easy data generation via scale factors. From the generated dataset, the lineitem table was extracted and exported as a Parquet file using DuckDB.

Test Design

Three experiments isolate different scaling dimensions:

  1. Single-table scaling (2 GB → 140 GB): Tests how engines handle increasing data volume in a single file. This isolates memory management and scanning strategies without file-count effects.

  2. Many small files (up to 72 × 2 GB): Tests file I/O overhead and parallel scan coordination. Real-world data lakes often contain thousands of partitioned files; this scenario evaluates how engines handle file metadata and concurrent reads.

  3. Few very large files (multiple 140 GB files → 2 TB): Combines both challenges—extreme file sizes and multi-file coordination. This represents worst-case scenarios for memory-constrained environments.

In the latter two tests, UNION ALL ensures all input data is processed without deduplication and prevents metadata-only row-count optimizations.

Context: Building on Previous Benchmarks

Our previous article compared these engines on a 9 GB CSV dataset. Key findings:

  • DuckDB: ~300 MB peak memory, fastest execution
  • Polars (lazy + streaming): ~0.5 GB peak memory, competitive speed
  • Pandas: ~10 GB peak memory, slowest execution

Those results suggested Polars could match DuckDB's efficiency at modest scales. This stress test examines whether that parity holds at 10-100× larger scales and with different file formats (Parquet vs CSV).

Results

Across all experiments, the logical query remained identical while the physical layout and total dataset size varied from a few gigabytes to nearly 2 TB. This isolates the effects of file size, file count, and execution strategy on runtime and incremental peak memory.

1. Single-Table Scaling

When a single Parquet file grows from approximately 2 GB to 140 GB, Polars and DuckDB exhibit very similar execution times, with DuckDB maintaining a slight advantage of roughly one second at the largest scale.

polars_duckdb_normal_time.png

DuckDB shows a slow and steady increase in memory usage as data volume grows, peaking at around 1.3 GB. In contrast, Polars' memory usage increases much more aggressively, surpassing 1 GB at roughly 8 GB of data and reaching approximately 17 GB for the ~140 GB dataset.

polars_duckdb_normal_memory.png

This 13× memory difference matters critically in memory-constrained environments. On a 32 GB machine processing the 140 GB dataset, DuckDB leaves ~30 GB free for the OS and other processes, while Polars consumes over half of available RAM.

This result surprised us, particularly given Polars' strong performance in our previous 9 GB benchmark. We investigated further with help from the Polars team and gained additional insight into Polars' internal mechanisms.

Polars' high observed memory consumption stems primarily from mmap page loading. On systems with less available RAM, fewer pages would be loaded, and the apparent memory usage would be lower. Importantly, this memory is not permanently owned by the engine; if the operating system requires it, these pages can be reclaimed.

In typical scenarios, this behavior is entirely reasonable. However, for our specific benchmarking setup, Polars can be configured to behave more similarly to DuckDB. By setting the environment variable POLARS_FORCE_ASYNC=1, Polars switches its default DynByteSource from mmap to ObjectStore for Parquet and IPC readers. This causes Polars to eagerly read bytes into memory instead of relying on lazy page faults via mmap. These settings are optimized for object storage and have not yet been fully optimized for local storage, though the performance impact is expected to be limited.

Adding Polars with forced async to the benchmark shows that this configuration is noticeably slower than both DuckDB and default Polars, requiring nearly three additional seconds for the largest workload.

polars_polars-async_duckdb_normal_time.png

This slower execution time trades performance for improved memory behavior. With forced async enabled, Polars’ memory usage remains comparatively low and grows only modestly with larger workloads—roughly 350 MB for the smallest dataset and around 750 MB for the largest. This contrasts sharply with the dramatic increase observed under the default Polars configuration.

polars_polars-async_duckdb_normal_memory.png

2. Many Small Files

Combining up to 72 Parquet files of approximately 2 GB each results in similar execution times to the single-file case, with DuckDB showing a slightly more pronounced advantage. Interestingly, there is no substantial execution-time difference between default Polars and Polars with forced async in this case.

polars_polars-async_duckdb_stress-small_time.png

DuckDB’s memory usage remains consistently below 70 MB, increasing only marginally as data volume grows. Polars, by contrast, shows a sharp increase in memory consumption—from about 400 MB at 2 GB to roughly 4.3 GB at 40 GB—after which usage largely plateaus. Similar to execution time, both Polars configurations behave more similarly in this test than in the others, scaling in a comparable manner and crossing over at around 100 GB of data, where the default Polars setup becomes slightly more memory-efficient than the async variant.

polars_polars-async_duckdb_stress-small_memory.png

3. Few Very Large Files

Execution-time trends in this scenario are similar to those observed in the second test. For the ~2 TB dataset, Polars completes the workload in approximately one minute, while DuckDB finishes in roughly 45 seconds. Polars with forced async again requires significantly more time, approaching 100 seconds for the largest workload.

polars_polars-async_duckdb_stress-big_time.png

Polars’ memory usage scales from about 1.7 GB to 23 GB across the tested volumes. DuckDB, in contrast, increases from approximately 500 MB to 2.4 GB, maintaining a significantly lower peak memory footprint throughout. With forced async enabled, Polars stays closer to DuckDB, showing a difference of roughly 200 MB for the smallest workload and around 5 GB for the largest. For the largest dataset, forced async reduces Polars’ peak memory usage by up to 2.8×.

polars_polars-async_duckdb_stress-big_memory.png

4. File Layout Impact: Why Partitioning Matters

Comparing identical data volumes across different file layouts reveals a critical finding: partitioning dramatically reduces memory usage for both engines. We compare the single-table scenario with the many-small-files scenario—both represent similar total data volumes but differ in file layout.

Aside from Polars with forced async, no engine shows significant execution-time improvement from partitioning data into smaller files.

polars_duckdb_overlay_time.png

Both engines benefit substantially in terms of memory usage. With partitioned workloads, DuckDB reduces its peak memory consumption by up to 8×, while Polars achieves reductions of up to 4×. Polars with forced async performs noticeably better on single large files—even approaching DuckDB's memory efficiency—than when processing many smaller files.

polars_duckdb_overlay_memory.png

Practical Implications

For the same ~140 GB of data:

  • DuckDB: 1.3 GB peak (single file) → 160 MB peak (72 small files) — 8× reduction
  • Polars: 17 GB peak (single file) → 4.3 GB peak (72 small files) — 4× reduction

File organization is not just a performance concern—it's a memory safety issue. Data engineering teams working with large datasets should consider partitioning strategies even when distributed processing isn't required. A monolithic 140 GB Parquet file might work fine in DuckDB but could cause OOM failures with Polars on typical developer laptops.

Discussion

Memory Behavior

Across all experimental configurations, DuckDB’s incremental peak memory usage remained below approximately 2.5 GB, even when processing nearly 2 TB of Parquet data. In contrast, Polars exhibited substantially higher peak memory consumption on very large input files, reaching around 20 GB in some cases. When the same data was partitioned into many smaller files, the incremental peak memory usage of both engines decreased significantly.

These results align with DuckDB’s documentation, which recommends Parquet file sizes between 100 MB and 10 GB. Interestingly, Polars with forced async not only failed to benefit from partitioning in this setup, but actually lost its competitive edge relative to DuckDB.

Polars and Large Files

While Polars achieves competitive execution times, its peak memory usage increases rapidly for large files. The same workload becomes significantly more memory-efficient when we split it into many smaller files, suggesting that Polars' bottleneck lies in row-group and page-level buffering combined with multithreaded scanning and decompression.

In Arrow-style Parquet reading, decompression occurs at the page and column-chunk level, and buffers can be sizable. When we partition workloads into smaller files, fewer row groups are processed concurrently, reducing the worst-case sum of decode, mask, and materialization buffers. Switching to forced async can mitigate this behavior by trading execution speed for reduced memory pressure.

DuckDB’s architecture, by contrast, enforces strict memory limits through its buffer manager. Although DuckDB also decodes Parquet pages, its execution model is designed to keep the working set bounded and to stream or evict data aggressively, preventing large transient memory spikes.

Peak vs. Final Memory Usage

Polars allocates substantial transient memory during query execution—primarily for scanning, decompression, and intermediate buffers—but releases most of it afterward, resulting in a post-execution footprint close to baseline. DuckDB maintains a tighter bound on memory usage throughout execution. In memory-constrained environments, peak usage is more critical than final usage, as it determines the risk of out-of-memory failures during execution. This behavior is largely attributable to mmap.

Polars with forced async behaves more like DuckDB in this respect, as it relies on traditional read/write I/O, which is better suited for streaming workloads.

Row Group Size

Row group size plays a critical role in DuckDB's performance. DuckDB parallelizes Parquet scans at the row-group level; consequently, files with a single large row group can only be processed by a single thread. DuckDB recommends row groups of 100 K–1 M rows and warns that sizes below 5,000 rows can degrade performance by 5–10×.

Polars' default row group size is approximately 250 K rows. To account for this, we explicitly generated Parquet files with a row group size of 300,000 rows.

Using optimized row groups significantly improved execution time for both engines:

Impact of Row Group Size Optimization (2 TB dataset):

ConfigurationPolarsDuckDB
Small row groups (default)170s70s
Optimized row groups (300K)60s44s
Improvement2.8× faster1.6× faster

DuckDB's memory usage also dropped substantially, while Polars' memory usage remained relatively stable. This demonstrates that even with engine-agnostic formats like Parquet, physical layout decisions implicitly favor certain execution engines. Row group sizing is not just a tuning parameter—it's a first-class performance determinant.

Memory Usage Does Not Equal "Memory Usage"

Because Polars relies heavily on mmap by default, it is not entirely accurate to attribute observed memory usage solely to the engine itself—the operating system accounts for much of it. From a user's perspective, however, this distinction is largely academic: if a process occupies a given amount of memory and that becomes a system constraint, it makes little difference where the memory is technically allocated.

Polars' use of mmap is not a flaw, nor an argument against the engine. Both Polars and mmap are industry-standard tools used for good reason. Our benchmark simply exposes this design choice as a disadvantage in a very specific scenario. As always, benchmarks represent only one piece of a much larger puzzle and should be interpreted with care.

Conclusion

DuckDB and Polars both deliver strong performance on terabyte-scale analytical workloads, but they offer different configurations with distinct tradeoffs between memory efficiency and throughput.

DuckDB prioritizes memory efficiency. Its buffer manager enforces strict limits, keeping peak memory below 2.5 GB even when processing nearly 2 TB of data. This makes DuckDB the safer choice for memory-constrained environments, unpredictable workloads, and production ETL pipelines where out-of-memory failures are unacceptable.

Polars (default) prioritizes throughput. Its mmap strategy leverages OS page caching for fast reads but trades memory footprint for speed. On the 140 GB single-file test, default Polars consumed 17 GB peak memory but maintained competitive execution times. When data is well-partitioned into smaller files, memory usage drops to 4.3 GB—still higher than DuckDB but acceptable when memory headroom is available. This configuration makes Polars ideal for DataFrame-oriented workflows on well-partitioned datasets.

Polars with forced async offers a third option. By setting POLARS_FORCE_ASYNC=1, Polars switches from mmap to traditional I/O, achieving memory usage comparable to DuckDB (750 MB for 140 GB files). However, this trades speed for memory efficiency—the 2 TB workload takes 100 seconds versus 60 seconds for default Polars and 44 seconds for DuckDB. Counterintuitively, forced async performs better on single large files than on many small files, making it suitable for scenarios where large monolithic files cannot be repartitioned and memory constraints are tight.

The configuration choice depends on your constraints:

ConfigurationPeak Memory (140 GB file)Execution SpeedBest For
DuckDB1.3 GBFastestMemory-constrained (16-32 GB), production stability
Polars (default)17 GB → 4.3 GB (partitioned)Fastwell-partitioned data
Polars (forced async)750 MBSlowestMemory-constrained with large files, cannot repartition

One caveat with this observation has to be kept in mind: the memory used by Polars in the default configuration is dependent on the available memory of the system it's running on. If there is lots of free memory to use–like in our case–, mmap will use that memory. If there is only a limited amount of memory, mmap will use significantly less memory. The forced async configuration demonstrated that the engine itself only needs around 750 MB, excess memory is used up by mmap and thus varies based on the available free memory.

The most surprising finding: file layout matters more than engine choice for memory usage. Partitioning the same 140 GB dataset into 72 smaller files reduced DuckDB's memory by 8× and default Polars' by 4×. Data engineers should treat Parquet file organization as a first-class architectural decision, not just a storage detail.

Although Parquet is an open, engine-agnostic format, physical layout decisions—file size, row group size, compression—meaningfully affect execution behavior. The abstraction is leakier than commonly assumed. In practice, storage and compute remain tightly coupled.

If you are interested in leveraging DuckDB's capabilities in the cloud, consider enrolling in our on-demand Hands-on Workshop: Introduction to MotherDuck for a complete practical walkthrough.

share post

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.