Using Dagster with DuckDB

16.5.2025 | 4 minutes reading time

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to build pipelines where data is stored in DuckDB tables or queried via DuckDB’s SQL engine – all within a Dagster workflow. This guide will walk you through setting up a new Dagster project from scratch with DuckDB integration. It will demonstrate two practical Dagster assets: one that inserts a Pandas DataFrame into a DuckDB database, and another that reads data from files (CSV) using DuckDB’s SQL capabilities. The goal is to provide a clear, working example that you can use as a foundation for your own projects.

Basic Setup and Dependencies

Create a new Dagster project

I am using poetry to manage my dependencies. Install these dependencies to be able to follow the next steps: poetry add dagster dagster-duckdb dagster-webserver pandas This leaves us with the project structure below after setting up the dagster project with dagster project scaffold --name dagster_duckdb_demo. I added a folder called data. It will become necessary in the next section.

1.
2└── dagster_duckdb_demo
3    ├── dagster_duckdb_demo
4    │   ├── __init__.py
5    │   ├── assets.py
6    │   └── definitions.py
7    ├── dagster_duckdb_demo_tests
8    │   ├── __init__.py
9    │   └── test_assets.py
10   ├── data
11   ├── poetry.lock
12   ├── pyproject.toml
13   ├── README.md
14   ├── setup.cfg
15   └── setup.py

You should also have DuckDB installed on your machine in order to inspect the database with the CLI.

Define a DuckDB resource

To enable the usage of DuckDB in Dagster, a resource must be added. This can be achieved in the definitions.py file where you simply add a DuckDBResource from the dagster_duckdb package and reference the database. I will be using a database file in the data folder that was created in the previous step. The database will be created automatically if it does not exist.

1# definitions.py
2
3from dagster import Definitions, load_assets_from_modules  
4from dagster_duckdb import DuckDBResource  
5 
6import assets  
7 
8all_assets = load_assets_from_modules([assets])  
9 
10defs = Definitions(  
11    assets=all_assets,  
12    resources={  
13        "duckdb": DuckDBResource(database="data/data.duckdb")  
14    }  
15)

Creating Assets

Dagster encourages asset abstraction. Here’s how you can write simple read/write assets.

Writing a table from a pandas DataFrame

This example will load a pandas Dataframe into DuckDB for later processing. This is only an example to show the usage of python Dataframes directly inside SQL that will be used to fill the table with DuckDB.

1# dagster_duckdb_demo/assets/pandas_import.py
2
3import dagster as dg  
4import pandas as pd  
5from dagster_duckdb import DuckDBResource  
6 
7 
8def source_data() -> pd.DataFrame:  
9    return pd.DataFrame({  
10        "user_id": [1, 2, 3],  
11        "event": ["login", "purchase", "logout"],  
12        "timestamp": pd.to_datetime(["2025-05-01", "2025-05-02", "2025-05-03"])  
13    })  
14 
15 
16@dg.asset(  
17    kinds={"python", "pandas", "duckdb"}  
18)  
19def pandas_import(context: dg.AssetExecutionContext, duckdb: DuckDBResource) -> dg.MaterializeResult:  
20    example_df = source_data()  
21 
22    with duckdb.get_connection() as conn:  
23        conn.execute("CREATE OR REPLACE TABLE events AS SELECT * FROM example_df")  
24        preview_df = conn.execute("SELECT * FROM events limit 5").fetch_df()  
25        count = conn.execute("SELECT COUNT(1) FROM events").fetchone()[0]  
26 
27    return dg.MaterializeResult(  
28        metadata={  
29            "row_count": dg.MetadataValue.int(count),  
30            "preview": dg.MetadataValue.md(preview_df.to_markdown(index=False))  
31        }  
32    )

Importing Data from CSV

To show an alternative to the data import, an example CSV file is imported into the same DuckDB database with the following script. The file is directly referenced from the path so it must not be opened or imported before processing. It would also be possible to use a URL that is offering a CSV file.

1# dagster_duckdb_demo/assets/csv_import.py
2
3import dagster as dg  
4from dagster_duckdb import DuckDBResource  
5 
6 
7@dg.asset(  
8    kinds={"python", "duckdb"}  
9)  
10def csv_import(duckdb: DuckDBResource) -> dg.MaterializeResult:  
11    with duckdb.get_connection() as conn:  
12        conn.execute("""  
13            CREATE OR REPLACE TABLE example_data AS
14            SELECT *        	
15            FROM 'data/example.csv'  
16        """)  
17 
18        row_count = conn.execute("""  
19            SELECT COUNT(1)        	
20            FROM example_data    	
21        """).fetchone()[0]  
22 
23    return dg.MaterializeResult(  
24        metadata={  
25            "row_count": dg.MetadataValue.int(row_count)  
26        }  
27    )

Downstream Assets

To show the usage of downstream assets in Dagster with DuckDB, an additional script will be added. Pay attention to the dependency between the login_events and the pandas_import asset.

1# dagster_duckdb_demo/assets/csv_import.py
2
3import dagster as dg  
4from dagster_duckdb import DuckDBResource  
5 
6 
7@dg.asset(  
8    kinds={"python", "duckdb"},  
9    description="Select login events from events table for further processing",  
10    deps=["pandas_import"]  
11)  
12def login_events(duckdb: DuckDBResource) -> dg.MaterializeResult:  
13    with duckdb.get_connection() as conn:  
14        conn.execute("""
15            CREATE OR REPLACE TABLE login_events AS
16            SELECT *
17            FROM events
18            WHERE event = 'login'            	
19        """)  
20        login_event_count = conn.execute("""
21            SELECT COUNT(1)
22            FROM login_events
23        """).fetchone()[0]  
24 
25        return dg.MaterializeResult(  
26            metadata={  
27                "login_event_count": dg.MetadataValue.int(login_event_count)  
28            }  
29        )

Updating the Definitions and Materialize

The updated defitions.py file now contains the assets from the previous step.

1# dagster_duckdb_demo/defintions.py
2
3import dagster as dg  
4from dagster_duckdb import DuckDBResource  
5 
6from .assets import csv_import, login_events, pandas_import  
7 
8all_assets = dg.load_assets_from_modules([csv_import, login_events, pandas_import])  
9 
10defs = dg.Definitions(  
11    assets=all_assets,  
12    resources={  
13        "duckdb": DuckDBResource(database="data/data.duckdb")  
14    }  
15)

Launching the Dagster webserver via poetry run dagster dev will now show the following dependency graph: dagster+duckdb+article+dependency+graph.png

Materializing these assets should show the configured metadata after a successful run for each asset. Inspecting the DuckDB file via the CLI returns the following overview: dagster+duckdb+article+duckdb+tables.png

Conclusion

As we have now seen, integrating DuckDB into the Dagster workflow is quite simple. Once the resource is defined, it can be used for every asset easily and offers a capable API to directly save Dataframe or CSV files for further processing. In a next step, dbt could be added for data transformations or the local DuckDB instance could be swapped for a connection to MotherDuck to leverage their Cloud Data Warehouse features.

Was this post helpful?

Blog author

Hendrik Kamp

IT Consultant

Do you still have questions? Just send me a message.

fromHendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next...

Data

25.4.2025 | 3 minutes reading time

Hendrik Kamp

How to validate your Spring Boot implementation when choosing an API first...

When choosing to follow the API First approach, ensuring that the actual implementation follows the defined specification can present a significant challenge. Achieving alignment between the specification and implementation is crucial, as it greatly...

Spring
API
Java
Validation

7.6.2024 | 6 minutes reading time

Hendrik Kamp

Zero-trust architecture – Why we need to end perimeter-based security

Introduction This article will help you understand the importance of zero-trust architecture and why it is the state of the art to protect your organization from cyberattacks. We see it as fundamental knowledge for solution and system architects to ...

IT-Security
Networking

29.9.2023 | 9 minutes reading time

Falko Lehmann

Hendrik Kamp

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Querying Databricks Delta Tables in Motherduck

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 minutes reading time

Daniel Kocot

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 minutes reading time

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Charge your APIs Volume 13: Data meets APIOps

In the swirling digital vortex that modern businesses navigate, two things stand clear as day: our escalating reliance on Application Programming Interfaces (APIs) and the immeasurable value of data. The API Operations (APIOps) pipeline, with its automated...

API
Data

24.8.2023 | 11 minutes reading time

Daniel Kocot

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 minutes reading time

Denis Stalz-John

Money, Money, Money - Monetization of APIs needs more than just a business...

Welcome to my blog series on the topic of my bachelor's thesis, "Real-time dashboard with distributed streaming". To summarize, it discusses the visualization of API-related data that is essential for business owners. How is this series structured? This...

API
Streaming
Data

27.10.2022 | 5 minutes reading time

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 minutes reading time

Denis Stalz-John

BigQuery to the rescue: How to prototype an ML system for a medium-sized...

BigQuery can help with building an ML system for production with a short time to market.Follow industry standards. Agile methods, the MLOps framework and focus on an MVP are helpful.Model improvement is not everything. A good model evaluation as well...

Data

2.2.2022 | 9 minutes reading time

Felix Medam

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 minutes reading time

Berthold Schulte

Using Dagster with DuckDB

Basic Setup and Dependencies

Creating Assets

Updating the Definitions and Materialize

Conclusion

Was this post helpful?

Blog author

More articles

Querying Databricks Delta Tables in Motherduck

How to validate your Spring Boot implementation when choosing an API first...

Zero-trust architecture – Why we need to end perimeter-based security

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 13: Data meets APIOps

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Money, Money, Money - Monetization of APIs needs more than just a business...

Python on an M1 chip: Running smoothly using Docker

BigQuery to the rescue: How to prototype an ML system for a medium-sized...

Evaluating machine learning models: Establishing quality gates