Popular searches
//

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights in 5 Minutes

30.9.2025 | 8 minutes reading time

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights in 5 Minutes

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine, using familiar SQL syntax, and without the need for a dedicated server or extensive configuration. Get ready to transform your raw CSV data into valuable insights in a matter of minutes!

CSV (Comma Separated Values) files remain a ubiquitous format for exchanging data, despite the emergence of newer, more structured formats like JSON and Parquet. Their simplicity and widespread compatibility ensure their continued relevance in various data workflows. However, working with large CSV files can sometimes be cumbersome, especially when you need to quickly extract insights without the overhead of complex setups.

From Gigabytes to Insights: Parsing CSVs with DuckDB

Setup

If you haven't set up the DuckDB CLI head over to DuckDB Installation and select your preferred option to install the DuckDB CLI.

About the dataset

We are using the well-known New York City Taxi dataset, specifically the Yellow Taxi Trip data from 2023 which is available here. This dataset contains approximately 38.3 million (38,310,226) records, and the CSV file size is 3.78 GB. This size is typically too large to manage effectively with tools like Microsoft Excel and we leverage the power of DuckDB and MotherDuck

[!NOTE]
In MotherDuck there is a sample_data database which contains the taxi data for December 2022 which is about 10% of the data volume we use in this blog post. You can also use this dataset if your upload link is limited.

Loading the dataset

DuckDB supports two persistence modes, in-memory and persistent. To load the dataset into a permanent DuckDB database, begin by creating the nyc_taxi.duckdb database. Execute the following command:

1duckdb nyc_taxi.duckdb

Once the database is created, load the dataset by executing the subsequent command within the DuckDB CLI.

1.timer on
2CREATE TABLE nyc_yellow_taxi_trips AS FROM '2023_Yellow_Taxi_Trip_Data_20250903.csv';

Depending on your machine, the dataset should load in mere seconds. For instance, on a Mac M3 the entire 3.78GB dataset, containing over 38.3 million records, loads in approximately 3.5 seconds, highlighting DuckDB's impressive performance capabilities.

1Run Time (s): real 3.540 user 38.143785 sys 1.770184

Our first query - calculate the trip duration for each trip

Once the data is loaded, a process that happens incredibly quickly, we can run our first query. We'll start by calculating the trip duration, which is the difference between tpep_dropoff_datetime and tpep_pickup_datetime, using the commands below.

1.timer on
2SELECT *, (tpep_dropoff_datetime - tpep_pickup_datetime) AS trip_duration 
3FROM nyc_yellow_taxi_trips;

You should see the following output

1Run Time (s): real 1.304 user 1.268312 sys 2.229287

As you can see the trip_duration is calculated in just a second or all 38.3 million records which is pretty awesome.

Enriching the data - store the trip duration as a new column

To facilitate further queries involving trip duration, we will store these values in a new column. This can be achieved by executing a simple command that creates a new column called trip_duration.

1ALTER TABLE nyc_yellow_taxi_trips
2ADD COLUMN trip_duration INTERVAL;
3UPDATE nyc_yellow_taxi_trips
4SET trip_duration = tpep_dropoff_datetime - tpep_pickup_datetime;

Now let's use the newly created column to calculate the average trip duration. For this execute the following command and you get the result again in milliseconds.

1.timer on
2SELECT AVG(trip_duration) FROM nyc_yellow_taxi_trips;

Moving to MotherDuck

We've explored DuckDB locally using its CLI, witnessing its remarkable speed with large CSV datasets. Now, let's transition to MotherDuck to share your data with your team, enabling them to leverage the combined power of DuckDB and MotherDuck.

[!NOTE]
For this you need a MotherDuck account which you can easily create on https://motherduck.com

To load the CSV data into MotherDuck, open a new DuckDB CLI session execute the following command:

1duckdb

Afterwards run the following commands in the DuckDB CLI shell.

[!NOTE]
If you haven't set the environment variable motherduck_token as documented in the MotherDuck documentation a browser window will open where you need to confirm the access to MotherDuck from your DuckDB CLI shell.

1ATTACH 'md:';
2CREATE OR REPLACE DATABASE nyc_yellow_taxi_trips;
3use nyc_yellow_taxi_trips;
4create or replace table nyc_yellow_taxi_trips as from '2023_Yellow_Taxi_Trip_Data_20250903.csv';

This generates a new database nyc_yellow_taxi_trips, and populates a table of the same name with the loaded data. The duration of this step can range from two to three minutes, contingent on your internet upload speed.

[!NOTE]
While this post focuses on CSV file stored on a local storage, it's worth noting that DuckDB/MotherDuck also supports importing CSV data from various object storage systems, including AWS S3, Azure Blob Storage, and Google Cloud Storage.

Let's advance to more complex queries using MotherDuck. We'll be executing these queries within the MotherDuck UI.

Average trip duration depending the number of passengers

To begin, navigate to https://app.motherduck.com and create a new notebook titled nyc_yellow_taxi_trips. In the initial cell, insert the SQL statement shown in the block below and execute it by clicking the small triangle button.

1SELECT passenger_count, AVG(tpep_dropoff_datetime - tpep_pickup_datetime) AS avg_trip_duration 
2FROM nyc_yellow_taxi_trips 
3GROUP BY passenger_count
4ORDER BY passenger_count;

Does the amount of tip depend on the trip distance?

For a taxi driver it would be interesting to know if there is a correlation between the amount of the tip and the length of the trip. For this we can execute the following query:

1SELECT 
2 CASE WHEN trip_distance BETWEEN 0 AND 4 THEN 'short'
3      WHEN trip_distance BETWEEN 4 AND 9 THEN 'medium'
4      WHEN trip_distance > 9 THEN 'long' 
5      END AS trip_length,
6 AVG(fare_amount) AS fare, 
7 AVG(tip_amount) AS tip
8 FROM nyc_yellow_taxi_trips
9 GROUP BY trip_length
10 ORDER BY tip DESC;
trip_lengthfaretip
long60.667412574820499.713919128310547
medium29.656544597381065.005007720056084
short12.7755052924382152.5128599942576764

As you can see from the results, the longer the trip the higher the amount of tip for the taxi driver.

Find the pickup locations which have the longest trip distance

As we have seen it is beneficial to look for longer trips as the amount of the tip is almost 4 times higher for longer trips compared to shorter trips.

Therefore let's find out which are the 20 most interesting pickup locations which have on average the longest trips and on average the highest tip. Only consider pickup locations with more than 1000 trips.

For this execute the following sql statement:

1SELECT
2    PULocationID,
3    COUNT(*) AS trip_count,
4    AVG(trip_distance) AS avg_trip_distance,
5    AVG(total_amount) AS avg_total_amount,
6    AVG(tip_amount) AS avg_tip_amount
7FROM
8    nyc_yellow_taxi_trips
9GROUP BY
10    PULocationID
11HAVING
12    trip_count > 1000
13ORDER BY
14    avg_tip_amount DESC
15LIMIT 20;

In just milliseconds calculating over 38.3 million records we have the information and with this information a taxi driver can try to optimize his income by strategically positioning himself in the vicinity of the most profitable pickup locations.

Are airport trips more profitable?

The last example we want to look at is if trips from the airport are more profitable than other trips.

We want to find out if the average tip is higher for trips from the airport compared to trips starting somewhere else. To simplify the example, we just take the average price without considering trip duration or other factors. Note that drop-off location may actually affect the amount of tip. The PULocationID for the three commercial airports are 1, 132, 138.

1SELECT
2    CASE
3        WHEN PULocationID IN (1, 132, 138) THEN 'airport'
4        ELSE 'not_airport'
5    END AS pickup_type,
6    AVG(tip_amount) AS avg_tip_amount,
7    COUNT(*) AS trip_count
8FROM
9    nyc_yellow_taxi_trips
10GROUP BY
11    pickup_type;

Again, the result is returned in milliseconds, confirming that tips for airport trips are nearly three times higher.

Conclusion

DuckDB and MotherDuck revolutionize data analysis by offering unparalleled speed and efficiency, especially when dealing with large datasets. The traditional hurdles associated with big data — such as the time-consuming loading processes and the need for complex infrastructure — are virtually eliminated.

The DuckDB Advantage: Speed and Simplicity

With DuckDB, loading gigabytes of data is no longer a multi-hour ordeal but a matter of seconds. This incredible speed extends to query execution, which is extremely fast, allowing for real-time insights that were previously unattainable. This marks a significant departure from conventional CSV tools like Microsoft Excel, which buckle under the weight of large datasets, often becoming unresponsive or taking an eternity to process.

Traditional data warehouse solutions, while powerful, often demand a significant upfront investment in time and resources. They typically require intricate data pipeline mechanisms to ingest data, followed by the provisioning and scaling of compute resources before even the first query can be executed. This elaborate setup often translates to days or even weeks of preparation before any meaningful analysis can begin.

MotherDuck: The Cloud-Native Companion

MotherDuck complements DuckDB by extending its capabilities to the cloud, offering a serverless experience that further simplifies data analysis. This combination means that users can leverage the power of DuckDB's in-process OLAP engine with the scalability and accessibility of a cloud platform, without the operational overhead.

The "First Query" Race

The true power of DuckDB and MotherDuck becomes evident in what can be called the first query race. In many scenarios, by the time other solutions are still being configured or are just beginning the data loading process, users of DuckDB and MotherDuck have already completed their analysis, extracted critical insights, and are ready to make data-driven decisions. This agility provides a significant competitive advantage, enabling faster iterations and more responsive business strategies.

In essence, DuckDB and MotherDuck are not just tools; they represent a paradigm shift in how data analysis is approached, making it more accessible, faster, and significantly less cumbersome for anyone looking to transform raw data into actionable intelligence.

share post

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.