Popular searches
//

Zero-ETL with MotherDuck: A Technical Deep Dive

6.10.2025 | 5 minutes reading time

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON, and CSV files (among others) located in Amazon S3, Azure Blob Storage, or Google Cloud Storage without requiring data movement or preprocessing. This approach represents a fundamental shift from conventional data warehouse architectures where data must be ingested, transformed, and stored before analysis can begin.

The zero-ETL approach allows organizations to maintain their data in its original format and location while performing analytical queries. This eliminates data duplication, reduces storage costs, and removes the latency associated with traditional ETL processes. Most importantly, it enables data teams to query data immediately without waiting for batch jobs or complex pipeline orchestrations.

How MotherDuck Accesses Cloud Data

Direct File Access

When you execute a query against cloud storage, MotherDuck establishes direct connections to the storage service and reads only the necessary data segments. Consider this query:

1SELECT customer_id, SUM(order_total) as revenue
2FROM read_parquet('s3://analytics-bucket/orders/*.parquet')
3WHERE order_date >= '2025-01-01'
4GROUP BY customer_id;

MotherDuck doesn't copy these files to a staging area or intermediate storage. Instead, it leverages DuckDB's table functions that natively understand cloud storage protocols, enabling direct reads from AWS S3, Azure Blob Storage, or Google Cloud Storage endpoints. The system maintains persistent HTTP connections to these storage services, reusing them across multiple requests to minimize connection overhead and optimize for both latency and throughput. Prior authentication is required through IAM roles, access keys, or other cloud-native security mechanisms.

Format-Native Processing

For Parquet files, MotherDuck takes advantage of the columnar format's structure to minimize data transfer. When querying a 10GB Parquet file containing 50 columns but selecting only three columns, MotherDuck reads only those specific column chunks directly from storage. This columnar reading happens at the storage layer through HTTP range requests, not after downloading the entire file. A query selecting three columns from a 50-column dataset drastically reduces data transfer volume. For JSON files, MotherDuck employs stream parsing strategies that allow direct querying of nested structures:

1SELECT 
2    json_extract_string(data, '$.user.email') as email,
3    CAST(json_extract(data, '$.purchase.amount') AS DECIMAL(10,2)) as amount
4FROM read_json('s3://logs-bucket/events/*.json')
5WHERE json_extract_string(data, '$.event_type') = 'purchase';

The JSON reader detects types and column names automatically and optimizes the conversion to vectors.

Query Optimization Techniques

MotherDuck uses Predicate and Projection Pushdown to further enhance query performance. Both techniques will be illustrated with examples below.

Predicate Pushdown

Predicate pushdown represents one of the most impactful optimizations in MotherDuck's execution engine. Instead of reading all data and filtering afterward, the system pushes filtering operations as close to the storage layer as possible. Consider this scenario with Parquet files:

1SELECT COUNT(*) as premium_sales
2FROM read_parquet('s3://sales-data/year=2024/month=*/day=*/*.parquet')
3WHERE sale_amount > 1000 
4    AND product_category = 'Premium';

MotherDuck applies multiple levels of filtering. First, partition elimination uses the directory structure to skip entire paths that don't match the query predicates. Second, row group elimination leverages Parquet metadata, specifically min/max statistics and Bloom filters when available, to skip row groups where the maximum sale_amount is below 1000. Third, late materialization defers reading non-essential columns until after filtering is complete.

This optimization can drastically reduce the transferred data volume and is even more effective with sorted data or when querying recent partitions in time-series datasets.

Projection Pushdown

Projection pushdown ensures that only required columns are transferred from storage. MotherDuck's query planner identifies all columns needed for the entire query execution, including those required for filtering, joining, aggregation, and final projection:

1SELECT customer_name, order_date, total_amount, shipping_address, payment_method
2FROM read_parquet('s3://orders-archive/2024/*.parquet')
3WHERE order_status = 'completed'
4    AND total_amount > 100;

The optimizer determines that it needs to read six columns: the five in the SELECT clause plus order_status for filtering. It configures the Parquet reader to request only these specific column chunks via HTTP range requests, ignoring the remaining columns entirely, decreasing memory usage and network transfer.

Hybrid Execution: Intelligent Query Routing

MotherDuck's hybrid architecture automatically determines optimal query execution location based on data locality and computational requirements. When connected to MotherDuck from a local DuckDB instance, the system routes operations intelligently:

1SELECT 
2    l.product_id,
3    l.local_price,
4    AVG(c.cloud_price) as avg_cloud_price
5FROM read_csv('local_prices.csv') l
6JOIN read_parquet('s3://pricing-data/historical/*.parquet') c
7  ON l.product_id = c.product_id
8GROUP BY l.product_id, l.local_price;

The execution engine performs S3 reads in MotherDuck's cloud environment. Local CSV operations execute on the client machine. The join strategy depends on relative data sizes: small local tables might be pushed to the cloud for processing, while aggregated cloud results might be pulled locally for final joining. This prevents unnecessary movement of large datasets while maintaining query performance.

Multi-Format Query Federation

MotherDuck can join data across different formats and locations within a single query. The system can combine Parquet, JSON, and CSV sources seamlessly:

1SELECT 
2    p.customer_id,
3    p.purchase_amount,
4    json_extract_string(s.session_data, '$.duration') as session_duration
5FROM read_parquet('s3://purchases/2025/*.parquet') p
6JOIN read_json('s3://sessions/2025/*.json') s
7  ON p.session_id = json_extract_string(s.session_data, '$.session_id')
8WHERE p.purchase_date >= '2025-01-01';

Each format is processed using its optimal access pattern while joins are executed using the most efficient strategy based on data size and distribution.

Limitations and Considerations

While MotherDuck's zero-ETL approach offers substantial benefits, certain scenarios require careful consideration. Extremely complex transformations involving multiple stages of windowing and aggregation may perform better with materialized intermediate results. Real-time streaming use cases requiring very low latency still benefit from dedicated streaming infrastructure. Regulatory requirements sometimes mandate data residency or transformation audit trails that zero-ETL approaches cannot provide.

File format considerations also matter. While Parquet offers the best optimization potential with its columnar format and rich metadata, JSON files lack these optimizations and may result in higher scan volumes.  Organizations should also consider network bandwidth costs when frequently querying large datasets across regions, and evaluate whether certain frequently-accessed, heavily-transformed datasets might benefit from selective materialization alongside the zero-ETL approach for occasional ad-hoc queries.

Conclusion

MotherDuck's zero-ETL capabilities represent an interesting alternative for cloud data analytics architecture. By treating cloud storage as a queryable data source with elegant pushdown optimizations, it eliminates the traditional boundaries between data storage and data processing. The combination of predicate pushdown, projection pushdown, and hybrid execution creates a system where large datasets become as accessible as local databases.

For organizations evaluating their data architecture, MotherDuck offers an alternative to traditional ETL pipelines. The ability to query data directly where it resides, in its native format, with full SQL capabilities and automatic optimizations, simplifies the entire analytics stack. This offers an option to rethink data architecture, making analytics more immediate, more flexible, and more accessible across the organization. As data volumes continue to grow and real-time insights become increasingly critical, the zero-ETL approach positions organizations to handle future scale without the complexity of traditional data pipeline architectures.

share post

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.