Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights in 5 Minutes

30.9.2025 | 8 minutes reading time

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine, using familiar SQL syntax, and without the need for a dedicated server or extensive configuration. Get ready to transform your raw CSV data into valuable insights in a matter of minutes!

CSV (Comma Separated Values) files remain a ubiquitous format for exchanging data, despite the emergence of newer, more structured formats like JSON and Parquet. Their simplicity and widespread compatibility ensure their continued relevance in various data workflows. However, working with large CSV files can sometimes be cumbersome, especially when you need to quickly extract insights without the overhead of complex setups.

From Gigabytes to Insights: Parsing CSVs with DuckDB

Setup

If you haven't set up the DuckDB CLI head over to DuckDB Installation and select your preferred option to install the DuckDB CLI.

About the dataset

We are using the well-known New York City Taxi dataset, specifically the Yellow Taxi Trip data from 2023 which is available here. This dataset contains approximately 38.3 million (38,310,226) records, and the CSV file size is 3.78 GB. This size is typically too large to manage effectively with tools like Microsoft Excel and we leverage the power of DuckDB and MotherDuck

[!NOTE]
In MotherDuck there is a sample_data database which contains the taxi data for December 2022 which is about 10% of the data volume we use in this blog post. You can also use this dataset if your upload link is limited.

Loading the dataset

DuckDB supports two persistence modes, in-memory and persistent. To load the dataset into a permanent DuckDB database, begin by creating the nyc_taxi.duckdb database. Execute the following command:

1duckdb nyc_taxi.duckdb

Once the database is created, load the dataset by executing the subsequent command within the DuckDB CLI.

1.timer on
2CREATE TABLE nyc_yellow_taxi_trips AS FROM '2023_Yellow_Taxi_Trip_Data_20250903.csv';

Depending on your machine, the dataset should load in mere seconds. For instance, on a Mac M3 the entire 3.78GB dataset, containing over 38.3 million records, loads in approximately 3.5 seconds, highlighting DuckDB's impressive performance capabilities.

1Run Time (s): real 3.540 user 38.143785 sys 1.770184

Our first query - calculate the trip duration for each trip

Once the data is loaded, a process that happens incredibly quickly, we can run our first query. We'll start by calculating the trip duration, which is the difference between tpep_dropoff_datetime and tpep_pickup_datetime, using the commands below.

1.timer on
2SELECT *, (tpep_dropoff_datetime - tpep_pickup_datetime) AS trip_duration 
3FROM nyc_yellow_taxi_trips;

You should see the following output

1Run Time (s): real 1.304 user 1.268312 sys 2.229287

As you can see the trip_duration is calculated in just a second or all 38.3 million records which is pretty awesome.

Enriching the data - store the trip duration as a new column

To facilitate further queries involving trip duration, we will store these values in a new column. This can be achieved by executing a simple command that creates a new column called trip_duration.

1ALTER TABLE nyc_yellow_taxi_trips
2ADD COLUMN trip_duration INTERVAL;
3UPDATE nyc_yellow_taxi_trips
4SET trip_duration = tpep_dropoff_datetime - tpep_pickup_datetime;

Now let's use the newly created column to calculate the average trip duration. For this execute the following command and you get the result again in milliseconds.

1.timer on
2SELECT AVG(trip_duration) FROM nyc_yellow_taxi_trips;

Moving to MotherDuck

We've explored DuckDB locally using its CLI, witnessing its remarkable speed with large CSV datasets. Now, let's transition to MotherDuck to share your data with your team, enabling them to leverage the combined power of DuckDB and MotherDuck.

[!NOTE]
For this you need a MotherDuck account which you can easily create on https://motherduck.com

To load the CSV data into MotherDuck, open a new DuckDB CLI session execute the following command:

1duckdb

Afterwards run the following commands in the DuckDB CLI shell.

[!NOTE]
If you haven't set the environment variable motherduck_token as documented in the MotherDuck documentation a browser window will open where you need to confirm the access to MotherDuck from your DuckDB CLI shell.

1ATTACH 'md:';
2CREATE OR REPLACE DATABASE nyc_yellow_taxi_trips;
3use nyc_yellow_taxi_trips;
4create or replace table nyc_yellow_taxi_trips as from '2023_Yellow_Taxi_Trip_Data_20250903.csv';

This generates a new database nyc_yellow_taxi_trips, and populates a table of the same name with the loaded data. The duration of this step can range from two to three minutes, contingent on your internet upload speed.

[!NOTE]
While this post focuses on CSV file stored on a local storage, it's worth noting that DuckDB/MotherDuck also supports importing CSV data from various object storage systems, including AWS S3, Azure Blob Storage, and Google Cloud Storage.

Let's advance to more complex queries using MotherDuck. We'll be executing these queries within the MotherDuck UI.

Average trip duration depending the number of passengers

To begin, navigate to https://app.motherduck.com and create a new notebook titled nyc_yellow_taxi_trips. In the initial cell, insert the SQL statement shown in the block below and execute it by clicking the small triangle button.

1SELECT passenger_count, AVG(tpep_dropoff_datetime - tpep_pickup_datetime) AS avg_trip_duration 
2FROM nyc_yellow_taxi_trips 
3GROUP BY passenger_count
4ORDER BY passenger_count;

Does the amount of tip depend on the trip distance?

For a taxi driver it would be interesting to know if there is a correlation between the amount of the tip and the length of the trip. For this we can execute the following query:

1SELECT 
2 CASE WHEN trip_distance BETWEEN 0 AND 4 THEN 'short'
3      WHEN trip_distance BETWEEN 4 AND 9 THEN 'medium'
4      WHEN trip_distance > 9 THEN 'long' 
5      END AS trip_length,
6 AVG(fare_amount) AS fare, 
7 AVG(tip_amount) AS tip
8 FROM nyc_yellow_taxi_trips
9 GROUP BY trip_length
10 ORDER BY tip DESC;

trip_length	fare	tip
long	60.66741257482049	9.713919128310547
medium	29.65654459738106	5.005007720056084
short	12.775505292438215	2.5128599942576764

As you can see from the results, the longer the trip the higher the amount of tip for the taxi driver.

Find the pickup locations which have the longest trip distance

As we have seen it is beneficial to look for longer trips as the amount of the tip is almost 4 times higher for longer trips compared to shorter trips.

Therefore let's find out which are the 20 most interesting pickup locations which have on average the longest trips and on average the highest tip. Only consider pickup locations with more than 1000 trips.

For this execute the following sql statement:

1SELECT
2    PULocationID,
3    COUNT(*) AS trip_count,
4    AVG(trip_distance) AS avg_trip_distance,
5    AVG(total_amount) AS avg_total_amount,
6    AVG(tip_amount) AS avg_tip_amount
7FROM
8    nyc_yellow_taxi_trips
9GROUP BY
10    PULocationID
11HAVING
12    trip_count > 1000
13ORDER BY
14    avg_tip_amount DESC
15LIMIT 20;

In just milliseconds calculating over 38.3 million records we have the information and with this information a taxi driver can try to optimize his income by strategically positioning himself in the vicinity of the most profitable pickup locations.

Are airport trips more profitable?

The last example we want to look at is if trips from the airport are more profitable than other trips.

We want to find out if the average tip is higher for trips from the airport compared to trips starting somewhere else. To simplify the example, we just take the average price without considering trip duration or other factors. Note that drop-off location may actually affect the amount of tip. The PULocationID for the three commercial airports are 1, 132, 138.

1SELECT
2    CASE
3        WHEN PULocationID IN (1, 132, 138) THEN 'airport'
4        ELSE 'not_airport'
5    END AS pickup_type,
6    AVG(tip_amount) AS avg_tip_amount,
7    COUNT(*) AS trip_count
8FROM
9    nyc_yellow_taxi_trips
10GROUP BY
11    pickup_type;

Again, the result is returned in milliseconds, confirming that tips for airport trips are nearly three times higher.

Conclusion

DuckDB and MotherDuck revolutionize data analysis by offering unparalleled speed and efficiency, especially when dealing with large datasets. The traditional hurdles associated with big data — such as the time-consuming loading processes and the need for complex infrastructure — are virtually eliminated.

The DuckDB Advantage: Speed and Simplicity

With DuckDB, loading gigabytes of data is no longer a multi-hour ordeal but a matter of seconds. This incredible speed extends to query execution, which is extremely fast, allowing for real-time insights that were previously unattainable. This marks a significant departure from conventional CSV tools like Microsoft Excel, which buckle under the weight of large datasets, often becoming unresponsive or taking an eternity to process.

Traditional data warehouse solutions, while powerful, often demand a significant upfront investment in time and resources. They typically require intricate data pipeline mechanisms to ingest data, followed by the provisioning and scaling of compute resources before even the first query can be executed. This elaborate setup often translates to days or even weeks of preparation before any meaningful analysis can begin.

MotherDuck: The Cloud-Native Companion

MotherDuck complements DuckDB by extending its capabilities to the cloud, offering a serverless experience that further simplifies data analysis. This combination means that users can leverage the power of DuckDB's in-process OLAP engine with the scalability and accessibility of a cloud platform, without the operational overhead.

The "First Query" Race

The true power of DuckDB and MotherDuck becomes evident in what can be called the first query race. In many scenarios, by the time other solutions are still being configured or are just beginning the data loading process, users of DuckDB and MotherDuck have already completed their analysis, extracted critical insights, and are ready to make data-driven decisions. This agility provides a significant competitive advantage, enabling faster iterations and more responsive business strategies.

In essence, DuckDB and MotherDuck are not just tools; they represent a paradigm shift in how data analysis is approached, making it more accessible, faster, and significantly less cumbersome for anyone looking to transform raw data into actionable intelligence.

To see these concepts in action, enroll in our on-demand Hands-on Workshop: Introduction to MotherDuck for a complete practical walkthrough.

Was this post helpful?

Blog author

Christian Galsterer

Do you still have questions? Just send me a message.

MotherDuck: Access Management and Scalable Analytics Overview

MotherDuck's architecture for storage management and user access is built on several key design principles that shape how data is organized and shared. To understand how MotherDuck manages access control, you need to understand three key concepts: organizations...

Data
MotherDuck

8.12.2025 | 6 minutes reading time

Hendrik Kamp

DuckDB vs. DataFrame Libraries

Update 10.12.25 – After helpful insights from Polars Engineer Thijs Nieuwdorp following the initial posting of this article, we were able to refactor our use of the deprecated .count() function in Polars, replacing it with the correct .len() function...

MotherDuck
Data
Data Science
Python
Database

1.12.2025 | 10 minutes reading time

Niklas Niggemann

ODPS: The Standard for Data Products

The data landscape in an organization often looks like this: teams gather and produce data everyday. Each team develops their own metadata models and documentation, if there is any at all. Governance policies exist in scattered documentation (spreadsheets...

Data

7.11.2025 | 4 minutes reading time

DuckDB and MotherDuck for customer facing analytics

MotherDuck
Data

21.10.2025 | 5 minutes reading time

Matthias Niehoff

DuckDB’s friendly SQL is a game changer for developer experience

I don’t think anyone will be surprised when I say that SQL is not the nicest language to work with. Some might even say that it has terrible ergonomics, especially for larger and more complex queries. Still, there are very good reasons why SQL is the...

Data
MotherDuck

14.10.2025 | 12 minutes reading time

Zero-ETL with MotherDuck: A Technical Deep Dive

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON...

MotherDuck
Data

7.10.2025 | 6 minutes reading time

Hendrik Kamp

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 minutes reading time

Daniel Kocot

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 minutes reading time

Daniel Kocot

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights in 5 Minutes

From Gigabytes to Insights: Parsing CSVs with DuckDB

Setup

About the dataset

Loading the dataset

Our first query - calculate the trip duration for each trip

Enriching the data - store the trip duration as a new column

Moving to MotherDuck

Average trip duration depending the number of passengers

Does the amount of tip depend on the trip distance?

Find the pickup locations which have the longest trip distance

Are airport trips more profitable?

Conclusion

The DuckDB Advantage: Speed and Simplicity

MotherDuck: The Cloud-Native Companion

The "First Query" Race

Was this post helpful?

Blog author

More articles in this subject area

MotherDuck: Access Management and Scalable Analytics Overview

DuckDB vs. DataFrame Libraries

ODPS: The Standard for Data Products

DuckDB and MotherDuck for customer facing analytics

DuckDB’s friendly SQL is a game changer for developer experience

Zero-ETL with MotherDuck: A Technical Deep Dive

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...