Zero-ETL with MotherDuck: A Technical Deep Dive

7.10.2025 | 5 minutes reading time

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON, and CSV files (among others) located in Amazon S3, Azure Blob Storage, or Google Cloud Storage without requiring data movement or preprocessing. This approach represents a fundamental shift from conventional data warehouse architectures where data must be ingested, transformed, and stored before analysis can begin.

The zero-ETL approach allows organizations to maintain their data in its original format and location while performing analytical queries. This eliminates data duplication, reduces storage costs, and removes the latency associated with traditional ETL processes. Most importantly, it enables data teams to query data immediately without waiting for batch jobs or complex pipeline orchestrations.

How MotherDuck Accesses Cloud Data

Direct File Access

When you execute a query against cloud storage, MotherDuck establishes direct connections to the storage service and reads only the necessary data segments. Consider this query:

1SELECT customer_id, SUM(order_total) as revenue
2FROM read_parquet('s3://analytics-bucket/orders/*.parquet')
3WHERE order_date >= '2025-01-01'
4GROUP BY customer_id;

MotherDuck doesn't copy these files to a staging area or intermediate storage. Instead, it leverages DuckDB's table functions that natively understand cloud storage protocols, enabling direct reads from AWS S3, Azure Blob Storage, or Google Cloud Storage endpoints. The system maintains persistent HTTP connections to these storage services, reusing them across multiple requests to minimize connection overhead and optimize for both latency and throughput. Prior authentication is required through IAM roles, access keys, or other cloud-native security mechanisms.

Format-Native Processing

For Parquet files, MotherDuck takes advantage of the columnar format's structure to minimize data transfer. When querying a 10GB Parquet file containing 50 columns but selecting only three columns, MotherDuck reads only those specific column chunks directly from storage. This columnar reading happens at the storage layer through HTTP range requests, not after downloading the entire file. A query selecting three columns from a 50-column dataset drastically reduces data transfer volume. For JSON files, MotherDuck employs stream parsing strategies that allow direct querying of nested structures:

1SELECT 
2    json_extract_string(data, '$.user.email') as email,
3    CAST(json_extract(data, '$.purchase.amount') AS DECIMAL(10,2)) as amount
4FROM read_json('s3://logs-bucket/events/*.json')
5WHERE json_extract_string(data, '$.event_type') = 'purchase';

The JSON reader detects types and column names automatically and optimizes the conversion to vectors.

Query Optimization Techniques

MotherDuck uses Predicate and Projection Pushdown to further enhance query performance. Both techniques will be illustrated with examples below.

Predicate Pushdown

Predicate pushdown represents one of the most impactful optimizations in MotherDuck's execution engine. Instead of reading all data and filtering afterward, the system pushes filtering operations as close to the storage layer as possible. Consider this scenario with Parquet files:

1SELECT COUNT(*) as premium_sales
2FROM read_parquet('s3://sales-data/year=2024/month=*/day=*/*.parquet')
3WHERE sale_amount > 1000 
4    AND product_category = 'Premium';

MotherDuck applies multiple levels of filtering. First, partition elimination uses the directory structure to skip entire paths that don't match the query predicates. Second, row group elimination leverages Parquet metadata, specifically min/max statistics and Bloom filters when available, to skip row groups where the maximum sale_amount is below 1000. Third, late materialization defers reading non-essential columns until after filtering is complete.

This optimization can drastically reduce the transferred data volume and is even more effective with sorted data or when querying recent partitions in time-series datasets.

Projection Pushdown

Projection pushdown ensures that only required columns are transferred from storage. MotherDuck's query planner identifies all columns needed for the entire query execution, including those required for filtering, joining, aggregation, and final projection:

1SELECT customer_name, order_date, total_amount, shipping_address, payment_method
2FROM read_parquet('s3://orders-archive/2024/*.parquet')
3WHERE order_status = 'completed'
4    AND total_amount > 100;

The optimizer determines that it needs to read six columns: the five in the SELECT clause plus order_status for filtering. It configures the Parquet reader to request only these specific column chunks via HTTP range requests, ignoring the remaining columns entirely, decreasing memory usage and network transfer.

Hybrid Execution: Intelligent Query Routing

MotherDuck's hybrid architecture automatically determines optimal query execution location based on data locality and computational requirements. When connected to MotherDuck from a local DuckDB instance, the system routes operations intelligently:

1SELECT 
2    l.product_id,
3    l.local_price,
4    AVG(c.cloud_price) as avg_cloud_price
5FROM read_csv('local_prices.csv') l
6JOIN read_parquet('s3://pricing-data/historical/*.parquet') c
7  ON l.product_id = c.product_id
8GROUP BY l.product_id, l.local_price;

The execution engine performs S3 reads in MotherDuck's cloud environment. Local CSV operations execute on the client machine. The join strategy depends on relative data sizes: small local tables might be pushed to the cloud for processing, while aggregated cloud results might be pulled locally for final joining. This prevents unnecessary movement of large datasets while maintaining query performance.

Multi-Format Query Federation

MotherDuck can join data across different formats and locations within a single query. The system can combine Parquet, JSON, and CSV sources seamlessly:

1SELECT 
2    p.customer_id,
3    p.purchase_amount,
4    json_extract_string(s.session_data, '$.duration') as session_duration
5FROM read_parquet('s3://purchases/2025/*.parquet') p
6JOIN read_json('s3://sessions/2025/*.json') s
7  ON p.session_id = json_extract_string(s.session_data, '$.session_id')
8WHERE p.purchase_date >= '2025-01-01';

Each format is processed using its optimal access pattern while joins are executed using the most efficient strategy based on data size and distribution.

Limitations and Considerations

While MotherDuck's zero-ETL approach offers substantial benefits, certain scenarios require careful consideration. Extremely complex transformations involving multiple stages of windowing and aggregation may perform better with materialized intermediate results. Real-time streaming use cases requiring very low latency still benefit from dedicated streaming infrastructure. Regulatory requirements sometimes mandate data residency or transformation audit trails that zero-ETL approaches cannot provide.

File format considerations also matter. While Parquet offers the best optimization potential with its columnar format and rich metadata, JSON files lack these optimizations and may result in higher scan volumes. Organizations should also consider network bandwidth costs when frequently querying large datasets across regions, and evaluate whether certain frequently-accessed, heavily-transformed datasets might benefit from selective materialization alongside the zero-ETL approach for occasional ad-hoc queries.

Conclusion

MotherDuck's zero-ETL capabilities represent an interesting alternative for cloud data analytics architecture. By treating cloud storage as a queryable data source with elegant pushdown optimizations, it eliminates the traditional boundaries between data storage and data processing. The combination of predicate pushdown, projection pushdown, and hybrid execution creates a system where large datasets become as accessible as local databases.

For organizations evaluating their data architecture, MotherDuck offers an alternative to traditional ETL pipelines. The ability to query data directly where it resides, in its native format, with full SQL capabilities and automatic optimizations, simplifies the entire analytics stack. This offers an option to rethink data architecture, making analytics more immediate, more flexible, and more accessible across the organization. As data volumes continue to grow and real-time insights become increasingly critical, the zero-ETL approach positions organizations to handle future scale without the complexity of traditional data pipeline architectures.

Was this post helpful?

Blog author

Hendrik Kamp

IT Consultant

Do you still have questions? Just send me a message.

fromHendrik Kamp

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next...

Data

25.4.2025 | 3 minutes reading time

Hendrik Kamp

How to validate your Spring Boot implementation when choosing an API first...

When choosing to follow the API First approach, ensuring that the actual implementation follows the defined specification can present a significant challenge. Achieving alignment between the specification and implementation is crucial, as it greatly...

Spring
API
Java
Validation

7.6.2024 | 6 minutes reading time

Hendrik Kamp

Zero-trust architecture – Why we need to end perimeter-based security

Introduction This article will help you understand the importance of zero-trust architecture and why it is the state of the art to protect your organization from cyberattacks. We see it as fundamental knowledge for solution and system architects to ...

IT-Security
Networking

29.9.2023 | 9 minutes reading time

Falko Lehmann

Hendrik Kamp

ODPS: The Standard for Data Products

The data landscape in an organization often looks like this: teams gather and produce data everyday. Each team develops their own metadata models and documentation, if there is any at all. Governance policies exist in scattered documentation (spreadsheets...

Data

7.11.2025 | 4 minutes reading time

DuckDB and MotherDuck for customer facing analytics

MotherDuck
Data

21.10.2025 | 5 minutes reading time

Matthias Niehoff

DuckDB’s friendly SQL is a game changer for developer experience

I don’t think anyone will be surprised when I say that SQL is not the nicest language to work with. Some might even say that it has terrible ergonomics, especially for larger and more complex queries. Still, there are very good reasons why SQL is the...

Data
MotherDuck

14.10.2025 | 12 minutes reading time

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Using Dagster with DuckDB

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 minutes reading time

Daniel Kocot

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 minutes reading time

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Zero-ETL with MotherDuck: A Technical Deep Dive

How MotherDuck Accesses Cloud Data

Direct File Access

Format-Native Processing

Query Optimization Techniques

Predicate Pushdown

Projection Pushdown

Hybrid Execution: Intelligent Query Routing

Multi-Format Query Federation

Limitations and Considerations

Conclusion

Was this post helpful?

Blog author

More articles

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

How to validate your Spring Boot implementation when choosing an API first...

Zero-trust architecture – Why we need to end perimeter-based security

More articles in this subject area

ODPS: The Standard for Data Products

DuckDB and MotherDuck for customer facing analytics

DuckDB’s friendly SQL is a game changer for developer experience

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals