DuckDB vs. DataFrame Libraries

1.12.2025 | 10 minutes reading time

Update 10.12.25 – After helpful insights from Polars Engineer Thijs Nieuwdorp following the initial posting of this article, we were able to refactor our use of the deprecated .count() function in Polars, replacing it with the correct .len() function.

Efficient processing of large, structured datasets is central to modern data analysis. Pandas has long been Python’s default DataFrame library, valued for its flexibility, rich ecosystem, and intuitive API. As datasets grow beyond memory and performance demands rise, newer tools like Polars and DuckDB have gained traction. While Polars and Pandas are DataFrame libraries and DuckDB is an embedded SQL analytics engine, all three aim to make large-scale data work faster and simpler—through parallel execution, lazy computation, and out-of-core processing.

This article compares Pandas, Polars, and DuckDB across performance, memory usage, scalability, ergonomics, and interoperability. We’ll highlight when a DataFrame-first workflow shines, when SQL-first tooling is better, and how these tools complement each other in real-world pipelines.

Background

Pandas

Pandas remains the most widely adopted DataFrame library, offering a mature API and seamless integration with the Python scientific stack. It shines in prototyping, data cleaning, and exploratory analysis. By default, pandas executes operations eagerly and largely single-threadedly, and typical workflows assume data fits in memory—so very large files (e.g., tens of GB) can lead to memory pressure or crashes unless you chunk or downsample. Recent releases (pandas 2.x) add copy-on-write and optional Arrow-backed dtypes that improve memory efficiency and interoperability, but pandas is not an out-of-core or parallel analytics engine.

Polars

Polars is a modern, columnar DataFrame library written in Rust with Python bindings, leveraging Apache Arrow for efficient memory representation. It supports multithreaded execution by default and offers both eager and lazy APIs. The lazy engine enables query optimization (projection and predicate pushdown) and, when used with streaming execution, can process datasets larger than RAM for many workloads. Not all operations stream, but for medium to large datasets the combination of parallelism, lazy optimization, and streaming often yields substantial speed and memory benefits. While its ecosystem is still growing and some pandas-specific features aren’t mirrored, Polars’ rapid development and optional GPU acceleration make it a compelling choice for high-performance data processing.

DuckDB

DuckDB is a modern, in-process OLAP SQL engine written in C++ and designed for high-performance analytical queries. It uses a vectorized, pipelined execution engine with a cost-based optimizer, supports multithreaded execution, and can process datasets larger than RAM via streaming scans and automatic spilling to disk for operations like sorts and aggregates. DuckDB can execute SQL directly over CSV files as well as in-memory DataFrames (pandas, Polars) and Arrow tables, enabling a seamless SQL-first workflow. It excels at complex joins, aggregations, and group-bys. DuckDB’s performance, scalability, and interoperability make it a powerful building block for analytics pipelines.

Methodology

Benchmarking Principles

Benchmarking is more than running the same operation on different systems. To ensure fair and reproducible results, we follow best practices for benchmarking, making all scripts and environments available for review and adhering to guidelines established by experts in the field, for example, DuckDB’s very own Hannes Mühleisen.

We benchmark each tool using its idiomatic, built-in workflow for large, structured data—without special tuning or external extensions. DuckDB is used as a SQL-first engine, querying files directly. Polars is used via its lazy scan API with streaming enabled, which is the documented approach for processing large files efficiently. Pandas is used with its standard eager DataFrame construction (read_csv) because it has no integrated out-of-core or lazy engine. This approach reflects how practitioners naturally solve the task in each tool while avoiding configuration knobs (thread counts, PRAGMAs, alternative parsers, GPU backends).

Test Setup

We benchmarked core OLAP operations—filtering and counting—using a real-world ecommerce dataset (CSV, 9 GB, 67 million rows, 9 columns). All tests were performed on a 2021 MacBook Pro (M1 Max, 32 GB RAM) using Python. Our benchmarking tool ensures consistent and reproducible command-line execution, allowing users to specify the tool, operation, benchmark mode (cold or hot), and number of runs. Results are visualized with matplotlib, and we report key statistical metrics: mean, standard deviation, and coefficient of variation.

Memory usage is measured by recording the memory consumed immediately before and after each function call; the difference represents the memory used by the function. Hot runs leverage OS page cache and library buffers. Cold runs execute in isolated processes with randomized file access; macOS page cache is not force-flushed, so results represent “colder” rather than fully uncached scenarios. For both modes, each operation is repeated 10 times.

We publish the exact scripts and environment details so readers can rerun the idiomatic paths as described in our repository. No special flags or extensions are required beyond installing the standard packages; results should be stable across similar hardware and OS configurations.

Results

Cold Runs

Cold benchmark results highlight DuckDB’s memory efficiency. Polars nearly matches DuckDB's execution time and memory usage, while Pandas lags significantly behind both tools in both metrics.

Pandas shows high overhead due to parsing the CSV and materializing a DataFrame. In cold runs, this consumes roughly 10 GB (matching the ~10 GB CSV), and filtering for purchase events adds about 1 GB more.

1Line #    Mem usage    Increment  Occurrences   Line Contents
2=============================================================
3     7    142.2 MiB    142.2 MiB           1   @profile
4     8                                         def filtering_counting():
5     9  13112.7 MiB  12970.5 MiB           1       df = pd.read_csv(dataset_path)
6    10  14491.0 MiB   1378.3 MiB           1       purchases = df[df["event_type"] == "purchase"]
7    11  14491.1 MiB      0.1 MiB           1       print("Count:", len(purchases))

In eager mode, Polars typically materializes a DataFrame from the entire dataset, which can lead to substantial memory usage. By contrast, when using a lazy CSV scan, Polars avoids loading the full dataset into memory and instead processes only the rows required for the specific operation. This approach enables efficient batch processing rather than loading everything at once, resulting in clear memory savings. Additionally, Polars’ lazy engine supports streaming execution, which further reduces memory consumption by processing data in smaller, manageable chunks.

1Line #    Mem usage    Increment  Occurrences   Line Contents
2=============================================================
3     7    142.8 MiB    142.8 MiB           1   @profile
4     8                                         def filtering_counting():
5     9    143.2 MiB      0.4 MiB           1       lf = pl.scan_csv(dataset_path)
6    10   1681.7 MiB   1538.5 MiB           1       result = lf.filter(pl.col("event_type") == "purchase").count().collect(streaming=True)
7    11   1681.9 MiB      0.2 MiB           1       print(result)

Initially, we thought this would be the maximum for Polars. But after we published this article on LinkedIn, Thijs Nieuwdorp, Developer Relations Engineer at Polars, reached out and pointed out an oversight:

I dove into it a little deeper and noticed that the DuckDB COUNT(*) doesn't translate well to our .count(). The latter counts the number of non-null elements in every column, causing us to scan the entire file. Instead, you could replace the .count() with .select(pl.len()) and get the same result as DuckDB, which should be a single column with one value, the length of the DataFrame.

We were happy to implement this suggestion, which significantly reduced Polars' memory usage by about 1 GB, as Thijs correctly observed:

1Line #    Mem usage    Increment  Occurrences   Line Contents
2=============================================================
3     7    142.1 MiB    142.1 MiB           1   @profile
4     8                                         def filtering_counting():
5     9    142.5 MiB      0.4 MiB           1       lf = pl.scan_csv(dataset_path)
6    10    592.7 MiB    450.3 MiB           1       result = lf.filter(pl.col("event_type") == "purchase").select(pl.len()).collect(streaming=True)
7    11    593.0 MiB      0.2 MiB           1       print(result)

DuckDB queries the CSV directly, using only ~300 MB for the entire process and returning results faster because it uses late materialization and vector-at-a-time processing with predicate pushdown and vectorized pipelines.

1Line #    Mem usage    Increment  Occurrences   Line Contents
2=============================================================
3     7    142.2 MiB    142.2 MiB           1   @profile
4     8                                         def filtering_counting():
5     9    428.3 MiB    286.1 MiB           1       duckdb.sql(f"SELECT COUNT(*) AS purchase_count FROM read_csv_auto('{dataset_path}') WHERE event_type = 'purchase'").show()

Hot Runs

In hot benchmarks, DuckDB’s advantage shrinks. Pandas reduces memory usage after initial runs and frequently alternates between freeing and consuming memory—yet still remains higher than the competition. Polars' memory usage also varies, but it benefits from a lower base level and frequent memory release.

Polars is thus able to close the gap to DuckDB.

None of the tools achieve significant time savings during hot runs. Over 10 hot runs, Pandas demonstrates a notable memory reduction of 10 GB compared to its cold runs; however, it still consumes substantially more memory than both Polars and DuckDB in their cold runs. In hot runs, Polars achieves even greater memory savings than DuckDB relative to their respective cold runs, allowing Polars to match DuckDB’s efficiency during hot runs.

Discussion

These benchmarks underscore how architecture and execution models drive the behavior of Pandas, Polars, and DuckDB on large datasets—results depend on data format, schema (string vs numeric), filter selectivity, and storage characteristics; our findings reflect a ~10 GB CSV on a local SSD.

Pandas is consistently slower and less predictable in memory because it eagerly materializes a full DataFrame from CSV and executes largely single-threaded. While many operations are array-level vectorized via NumPy, pandas lacks a query optimizer and a database-style vectorized/pipelined engine, and it does not offer integrated lazy or out-of-core execution. For big files, this translates into higher peak memory, more Python-object overhead (especially for strings), and limited parallelism. Advanced patterns (e.g., read_csv with chunksize/iterators) can reduce memory but change the programming model and fall outside our idiomatic comparison.

Polars, built in Rust on Arrow, leverages multithreading and a columnar expression engine to outperform pandas on speed. Its lazy API with streaming mode can avoid full materialization for many queries by pushing filters/projections into scans. Streaming is not universal: operations that require cross-batch context or materialize intermediates may still consume several gigabytes of RAM. In our filter-and-count workload and environment, we observed peak memory around ~0.5 GB with lazy+streaming; actual figures vary with schema and selectivity. Overall, Polars substantially improves memory efficiency and runtime versus pandas and even matches DuckDB during hot runs for large, on-disk analytics.

DuckDB delivers consistent, low memory and strong performance by combining a cost-based optimizer with vectorized, pipelined execution and late materialization. It pushes predicates and projections into file scans, streams data without fully materializing tables, keeping peak RSS small and stable. Running in-process with C++ data structures avoids Python-object overhead, and multicore utilization is automatic, making DuckDB well-suited for fast ad-hoc analytics over large CSV files.

Why is Pandas so slow?

DuckDB and Polars process only the necessary columns, execute in cache-friendly vectors, and leverage parallel pipelines; pandas eagerly builds full DataFrames and lacks a query optimizer or a database-style execution engine. Despite array-level vectorization via NumPy, pandas’ typical usage incurs more computation, more memory traffic, and limited parallelism at scale. Without integrated lazy/out-of-core execution, pandas’ idiomatic path remains memory-bound. While chunked reading can help, it changes the workflow and isn’t directly comparable to SQL/lazy pipelines.

Conclusion

Pandas, Polars, and DuckDB each excel in different parts of modern analytics. For small to medium datasets and rich library integration, pandas remains a productive default. For larger or performance-sensitive workloads, Polars and DuckDB deliver substantial gains via parallel execution and out-of-core/lazy pipelines. In our CSV-based benchmarks, DuckDB provided the most consistent performance and the lowest peak memory by querying files directly without fully materializing tables. Thanks to the suggestions from the Polars team, Polars was able to reduce its memory usage by about 1 GB and now positions itself competitively just behind DuckDB by roughly 100 MB. This highlights how small implementation details can have a significant impact on real-world performance and memory efficiency.

This exchange of knowledge between tool developers and practitioners is vital for the progress of the data ecosystem. Open dialogue—such as the feedback we received from Thijs Nieuwdorp at Polars—not only helps correct misunderstandings and improve benchmarking accuracy, but also accelerates the adoption of best practices across the community. By sharing insights and collaborating openly, we ensure that both tools and users evolve together, leading to more robust, efficient, and user-friendly solutions. Such interactions highlight the importance of transparency, humility, and continuous learning in the rapidly changing landscape of data analytics.

DuckDB is strongest for SQL-first, on-disk analytics; Polars shines for fast DataFrame transformations (especially with lazy + streaming); pandas remains ideal for interactive munging on moderate-sized, in-memory data. The most effective strategy is blended: use each tool where its architecture aligns with the problem. Results may vary with data format (CSV vs Parquet), schema (string vs numeric), filter selectivity, and hardware.

If you are interested in leveraging DuckDB's demonstrated power into the cloud, enroll in our on-demand Hands-on Workshop: Introduction to MotherDuck for a complete practical walkthrough!

Was this post helpful?

Blog author

Niklas Niggemann

Working Student Data & AI

Do you still have questions? Just send me a message.

Talk to your Data Part 3: The Potential of Natural Language

This is the last and final part of our article series covering the new MCP server by MotherDuck. We have already presented the basics and challenges in previous parts. Now, we want to conclude with our findings and comments on the current state and give...

MotherDuck
Data

27.2.2026 | 7 minutes reading time

Hendrik Kamp

Niklas Niggemann

Talk to your Data Part 2: Limits and Performance Enhancements

In part one of this series, we introduced the MotherDuck MCP server in combination with opencode and showcased initial context engineering. We also showed deeper knowledge retrieval using natural language instead of SQL. In this article we will dive ...

MotherDuck
Data

19.2.2026 | 8 minutes reading time

Niklas Niggemann

Hendrik Kamp

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

MotherDuck's new MCP server gives us the opportunity to have a conversation with an AI models like Claude or ChatGPT and ask questions about our data that are directly transformed into SQL. The queries are executed against the actual data in our cloud...

MotherDuck
Data

12.2.2026 | 6 minutes reading time

Niklas Niggemann

Hendrik Kamp

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

In our previous benchmarks, DuckDB consistently outperformed Polars and Pandas on large analytical workloads, but performance comparisons miss a critical question: what happens when you need to move from local DuckDB development to a BigQuery production...

MotherDuck
Data
Big Data
Data Science

10.2.2026 | 6 minutes reading time

Niklas Niggemann

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

Update 02.02.26 – After helpful insights from the Polars team on LinkedIn, we enhanced our benchmark setup with a configuration of Polars where async is forced. This is elaborated in the article. Our previous benchmark compared DuckDB, Polars, and Pandas...

MotherDuck
Data Science
Data

20.1.2026 | 15 minutes reading time

Niklas Niggemann

MotherDuck: Access Management and Scalable Analytics Overview

MotherDuck's architecture for storage management and user access is built on several key design principles that shape how data is organized and shared. To understand how MotherDuck manages access control, you need to understand three key concepts: organizations...

Data
MotherDuck

8.12.2025 | 6 minutes reading time

Hendrik Kamp

ODPS: The Standard for Data Products

The data landscape in an organization often looks like this: teams gather and produce data everyday. Each team develops their own metadata models and documentation, if there is any at all. Governance policies exist in scattered documentation (spreadsheets...

Data

7.11.2025 | 4 minutes reading time

DuckDB and MotherDuck for customer facing analytics

MotherDuck
Data

21.10.2025 | 5 minutes reading time

Matthias Niehoff

DuckDB’s friendly SQL is a game changer for developer experience

I don’t think anyone will be surprised when I say that SQL is not the nicest language to work with. Some might even say that it has terrible ergonomics, especially for larger and more complex queries. Still, there are very good reasons why SQL is the...

Data
MotherDuck

14.10.2025 | 12 minutes reading time

Zero-ETL with MotherDuck: A Technical Deep Dive

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON...

MotherDuck
Data

7.10.2025 | 6 minutes reading time

Hendrik Kamp

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

DuckDB vs. DataFrame Libraries

Background

Pandas

Polars

DuckDB

Methodology

Benchmarking Principles

Test Setup

Results

Cold Runs

Hot Runs

Discussion

Why is Pandas so slow?

Conclusion

Was this post helpful?

Blog author

More articles in this subject area

Talk to your Data Part 3: The Potential of Natural Language

Talk to your Data Part 2: Limits and Performance Enhancements

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

MotherDuck: Access Management and Scalable Analytics Overview

ODPS: The Standard for Data Products

DuckDB and MotherDuck for customer facing analytics

DuckDB’s friendly SQL is a game changer for developer experience

Zero-ETL with MotherDuck: A Technical Deep Dive

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts