Great Expectations: Validating datasets in machine learning pipelines

17.2.2020 | 6 minutes reading time

Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true and hence it is strongly advised to validate datasets before feeding them into a machine learning algorithm.

Generally, validating datasets is a tedious task since we have to write a plethora of checks to ensure that the dataset contains all required columns and that the columns contain only expected values. Having written many dataset tests by hand, I was quite happy to stumble upon the Python library great_expectations , which is a promising tool to validate datasets in a painless way.

In this blogpost, I want to introduce great_expectations and share some of my thoughts about why I think this tool is helpful in the toolset of every data person.

The problem – why validate datasets?

From a high-level point of view there are (at least) two kinds of problems occurring while engineering a dataset. First, there are more or less obvious technical errors such as missing rows or columns and wrong datatypes. Second, even when the actual data pipelines are solid and the datasets are put together in a technically correct way, there are often issues with degeneration of data over time. Here, too, we have obvious changes, e.g. additional categories in a categorical column. However, many changes in the data often go undetected. For example:

the values of a binary column might be approximately evenly distributed between 0 and 1 at the beginning and the distibution could become skewed over time.
the mean value and standard deviation of sensor data emitted by a physical sensor could drift over time.

Obvious changes in the data or mistakes while engineering the dataset typically lead to errors in the machine learning pipeline and hence are addressed as soon as they occur. The silent changes, however, are more subtle and they potentially impair the performance of the machine learning model visualized in the following picture.

For this reason data monitoring and validation of datasets is crucial when operating machine learning systems.

In the following, we will look at a small example to introduce great_expectations as a tool for dataset validation.

Small example

In our example, we use the public domain hmeq-dataset from Kaggle. The context of the dataset is automation of the decision-making process for approval of lines of credit. However, in this blogpost we are not interested in the machine learning aspect of the problem. Instead, our goal is to use this dataset in order to show some ideas of the great_expectations library.

In this small example, we will take a short look at:

Basic table expectations
Expectations for categorical data
Expectations for numeric data
Saving expectations and validating other datasets

Preliminaries

The recommended way to follow the small example is to create a fresh Python 3.8 environment and install great_expectations and jupyter via

1pip install great_expectations
2pip install jupyter

Then, we start a jupyter-notebook and import the library with

1import great_expectations as ge

Because great_expectations wraps the popular pandas Python library, we can use pandas functionality to import datasets. Hence, we may use

1df = ge.read_csv('hmeq.csv')

to read the dataset. In our example, we want to simulate a situation where we generate expectations for a dataset and then apply these expectations to validate, for example, a newer version of the dataset. For this reason, we execute

1df = df.sample(frac=1).reset_index(drop=True)
2split = int(len(df)/2)
3df1 = df[:split]
4df2 = df[split:]

to shuffle the dataset and split it into two subsets. Now, we can create expectations using df1 and validate the dataset df2.

Basic table expectations

We can generate hypotheses for the table with great_expectations. For example, we can use

1min_table_length = 2500
2max_table_length = 3500
3df1.expect_table_row_count_to_be_between(min_table_length, max_table_length)

if we have an idea how many rows our dataset should have. Typically, we require specific feature columns in our dataset for our machine learning algorithm. We can create expectations for columns to exist via

1feature_columns = ['LOAN', 'VALUE', 'JOB', 'YOJ', 'CLNO', 'DEBTINC']
2for col in feature_columns:
3    df1.expect_column_to_exist(col)

Table expectations provide simple sanity checks for the dataset. great_expectations manages all expectations in a json file. We can print all established expecations with

1df1.get_expectation_suite()

So far the json file should look something like this:

1{'data_asset_name': None,
2 'expectation_suite_name': 'default',
3 'meta': {'great_expectations.__version__': '0.8.7'},
4 'expectations': [{'expectation_type': 'expect_table_row_count_to_be_between',
5   'kwargs': {'min_value': 2500, 'max_value': 3500}},
6  {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'LOAN'}},
7  {'expectation_type': 'expect_column_to_exist',
8   'kwargs': {'column': 'VALUE'}},
9  {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'JOB'}},
10  {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'YOJ'}},
11  {'expectation_type': 'expect_column_to_exist', 'kwargs': {'column': 'CLNO'}},
12  {'expectation_type': 'expect_column_to_exist',
13   'kwargs': {'column': 'DEBTINC'}}],
14 'data_asset_type': 'Dataset'}

Expectations for categorical data

Besides checking the whole dataframe, we can also address specific columns. As an example of categorical data, we use the column 'JOB'. First, we employ

1df1.expect_column_values_to_be_of_type('JOB', 'object')

to expect the correct dtype which typically is 'object' in case of categorical data. Next, we can create an expectation for the expected values in the column with

1expected_jobs = ['Other', 'ProfExe', 'Office', 'Mgr', 'Self', 'Sales']
2df1.expect_column_values_to_be_in_set('JOB', expected_jobs)

A very nice feature of great_expectations is the possibility to create expectations concerning the distribution of the column values. For this purpose we start by creating a categorical partition of the data.

1expected_job_partition = ge.dataset.util.categorical_partition_data(df1.JOB)

Then, we can use

1df1.expect_column_chisquare_test_p_value_to_be_greater_than('JOB', expected_job_partition)

to prepare a Chi-squared test for comparing categorical distributions.

Expectations for numeric data

As an example of numeric data, we use the column 'LOAN'. Again, we start with

1df1.expect_column_values_to_be_of_type('LOAN', 'int64')

to prepare a check for the correct dtype. In addition, we can use expectations such as

1df1.expect_column_mean_to_be_between('LOAN', 10000, 20000)
2df1.expect_column_max_to_be_between('LOAN', 50000, 100000)
3df1.expect_column_min_to_be_between('LOAN', 1000, 5000)

to ensure that min, max and mean of our data are in our expected ranges. Moreover, we can create a continuous partition of the data with

1expected_loan_partition = ge.dataset.util.continuous_partition_data(df1.LOAN)

and use

1df1.expect_column_bootstrapped_ks_test_p_value_to_be_greater_than('LOAN', expected_loan_partition)

to prepare a bootstrapped Kolmogorov-Smirnov test for comparing continuous distributions.

Save expectations and validate other datasets

So far we have defined multiple expectations regarding the dataset df1. In practice, we would require additional expectations concerning other columns of our dataset. For the purpose of our (small) example we stop here. We can save the json file containing our expectations via

1df1.save_expectation_suite('some_expectations.json')

In our workflow, we can (and usually should) place the file some_expectations.json under version control. Now, we can use the expecations to validate other datasets.

1df2.validate(expectation_suite='some_expectations.json', only_return_failures=True)

In this case, we do not expect to encounter any errors because we randomly split the dataset into two subsets. However, we can see the validation come into play, for example, by dropping a column

1df2_missing = df2.drop(columns=['LOAN'])
2df2_missing.validate(expectation_suite='some_expectations.json', only_return_failures=True)

or by setting a loan value which is too small

1df2_min_low = df2.copy()
2df2_min_low.at[4, 'LOAN'] = 10
3df2_min_low['LOAN'] = df2_min_low['LOAN'].astype('int64')
4df2_min_low.validate(expectation_suite='some_expectations.json', only_return_failures=True)

Conclusion

In the example, we only covered a small subset of the available features of great_expectations. The tool offers more functionality such as

more built-in expectations and even custom expecations
ways to integrate into data pipelines, e.g. with support for Spark
web-based data profiling and exploration
slack notification for failed validations

which I have not used outside of small tests.

In my opinion, great_expectations appears to be a useful addition in the tool kit of each data scientist/engineer. It has a low barrier of entrance since it can basically be reduced to an additional json file living in the code repository, but it has the potential to significantly simplify validating datasets and, in particular, debugging data pipelines.

At the moment, I am not a great fan of the initialization via great_expectations init and the resulting folder structure in the project directory. However, I did not use great_expectations under real conditions and maybe there are advantages of this setup that I do not see at the moment.

Overall, great_expectations appears to integrate nicely in many machine learning pipelines and I cannot wait to extensively test the tool in future projects. If you have any experiences with great_expectations, feel free to share them in the comments.

Was this post helpful?

Blog author

Marcel Mikl

Service Lead Data & ML & AI

Do you still have questions? Just send me a message.

fromMarcel Mikl

DISH-O-TRON – Train that vision model!

With this article we continue our endeavor of building dish-o-tron – an AI system designed to prevent the sudden appearance of dirty dishes in the community kitchen sink, and hence turning the community kitchen into a place of peace and harmony. This...

AI
Computer Vision

11.10.2020 | 11 minutes reading time

Marcel Mikl

Oliver Moser

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 minutes reading time

Marcel Mikl

Oliver Moser

DISH-O-TRON – No more dirty dishes thanks to AI

Sadly, to tell you the truth, doing dishes is still a thing. However, so far most of our readers still like our non-standard Deep Learning tutorial. Typically, AI is demonstrated as solving various toy problems. AI plays chess and Go, AI plays video ...

10.9.2020 | 7 minutes reading time

Marcel Mikl

Oliver Moser

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

Remote training with GitLab-CI and DVC

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or...

Git
Machine Learning
CI/CD
AI
GitLab

27.1.2020 | 15 minutes reading time

Marcel Mikl

Bert Besser

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 [Missing String "readingTime"]

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 [Missing String "readingTime"]

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 [Missing String "readingTime"]

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 [Missing String "readingTime"]

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 [Missing String "readingTime"]

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 [Missing String "readingTime"]

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 [Missing String "readingTime"]

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 [Missing String "readingTime"]

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Charge your APIs Volume 13: Data meets APIOps

In the swirling digital vortex that modern businesses navigate, two things stand clear as day: our escalating reliance on Application Programming Interfaces (APIs) and the immeasurable value of data. The API Operations (APIOps) pipeline, with its automated...

API
Data

24.8.2023 | 11 [Missing String "readingTime"]

Daniel Kocot

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

In Part 1 of this series, Developing AWS Lambda Functions with Python and CDK, we covered the initial setup of a CDK and Python project. We walked through the process of creating a basic Hello World* Lambda function, testing it with a unit test, defining...

AWS
Serverless
Python

2.6.2023 | 2 [Missing String "readingTime"]

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

This blog post assumes that you are familiar with Python development and know the basic concepts of Amazon CDK. What's more, you should have an AWS account and have configured the AWS CLI. If you're new to CDK, go here, if you need to configure the AWS...

AWS
Serverless
Python

6.3.2023 | 6 [Missing String "readingTime"]

Simple Fraud Detection with PyMC

In one of my last projects, we were facing a prediction problem with very limited data. Each set of data took a specialist hours to compile, and results were not always successful. Therefore, we were looking for a tool to handle these requirements, as...

Python
Data Science

26.1.2023 | 7 [Missing String "readingTime"]

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 [Missing String "readingTime"]

Denis Stalz-John

Money, Money, Money - Monetization of APIs needs more than just a business...

Welcome to my blog series on the topic of my bachelor's thesis, "Real-time dashboard with distributed streaming". To summarize, it discusses the visualization of API-related data that is essential for business owners. How is this series structured? This...

API
Streaming
Data

27.10.2022 | 5 [Missing String "readingTime"]

Great Expectations: Validating datasets in machine learning pipelines

The problem – why validate datasets?

Small example

Preliminaries

Basic table expectations

Expectations for categorical data

Expectations for numeric data

Save expectations and validate other datasets

Conclusion

Was this post helpful?

Blog author

More articles

DISH-O-TRON – Train that vision model!

DISH-O-TRON – Gather that DATA you must!

DISH-O-TRON – No more dirty dishes thanks to AI

Thinking AI means re-thinking data

Remote training with GitLab-CI and DVC

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 13: Data meets APIOps

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

Simple Fraud Detection with PyMC

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Money, Money, Money - Monetization of APIs needs more than just a business...