Evaluating machine learning models: Establishing quality gates

7.12.2021 | 8 minutes reading time

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated and compared in a manageable way. If the number of models increases into the dozens or even hundreds, depending on the use case and whether they also have to be constantly retrained, the manual procedure no longer tends to scale.
The preparation of the data, the training of the models and their evaluation can be automated in the form of machine learning pipelines. However, a qualified evaluation of the data or an evaluation of the predictive quality of a model has its pitfalls – especially in automated form.

Example of a simple machine learning pipeline

In classic software development, unit tests, integration tests, end-to-end tests etc. have become established within CI/CD pipelines for quality assurance. However, to create machine learning models, measures that ensure the quality of a model are also necessary. These should accompany the process from the preparation of raw data to the delivery of a model. They can ideally be integrated into the pipeline as well.

Quality gate

The term ‘quality gate’ originates from project management and describes the introduction of ‘checkpoints’ into the project process. The aim is to divide the project duration into shorter, manageable sections in order to be able to track progress. In this context, verifiable success criteria are to be defined in advance for each gate.
For example, a gate contains a list of goals or quality requirements for the resulting artifacts at the end of a project phase. These must be met before the next stage in the process flow is started. However, checking the status can also lead to the project being aborted – because it becomes transparent that essential characteristics have not been achieved or will very likely not be achieved at all. Stopping a project at an early stage may therefore also save time, resources and money.

Quality gates in the project process

In the machine learning context, quality gates can be integrated as dedicated checks between the individual steps of an automated pipeline. Thus, they can track and monitor the quality of the artifacts in the process. Typical artifacts are the raw and processed data as well as the trained models.

Data

The quality of the data is the basis for being able to successfully train machine learning models at all. Therefore, the first quality gates are dedicated to the data and they ensure that the training of the models happens on a meaningful data basis. First, the raw data can be checked. They must correspond to the previous assumptions/statistics and characteristics, otherwise corrections in the setup of the model and training may become necessary.
In general, the preparation of the raw data is highly individual and may be costly. This means that further test points, which are not shown here, may have to be used between the preparation steps.
In addition, the distribution of the data in the resulting data sets can be tested. A fair and representative distribution of the data as well as a sufficient number of data sets in the sets should be guaranteed. Otherwise, the evaluations on possible validation and test sets are only of limited value [1] .
There is a tendency that an insufficient data basis does not lead to the high-quality models we aim for. If these gates are not passed, an early termination already at the beginning of the pipeline can save time, resources and money, especially if these gates can be introduced with manageable effort.

Examples of data quality gates

Metrics

In order to evaluate the quality of a model’s predictions in a pipeline, a set of automatically evaluable criteria is mandatory. Ideally, numerical metrics are used for this purpose. These can be evaluated on predictions of the model on the retained test set.
Of course, the possible metrics and threshold values depend on the particular application. For example, in case of a classification it is necessary to find a good tradeoff between coverage and precision and it has to be defined from which prediction certainty a class is considered to be recognized [2] . But once adequate metrics have been selected and the results or threshold values to be achieved have been defined, these criteria can be checked automatically in an initial quality gate for the model.
However, the question arises of course: Is it worth all the effort to introduce complicated machine learning models? Or aren’t there simpler alternatives to meet the requirements?

Baselines

Rather rudimentary solutions can usually be implemented cheaply and quickly using simpler models. These models can be created by heuristics, simple statistics or even a simple generation of random values and have to be beaten in a test. By comparing them with baseline models, one gets an idea of what performance is easy to achieve. To do this, predictions of the baseline and a candidate model can be generated on the same test set and compared to the selected metrics. How big the difference is to a simpler solution can thus be illustrated.
There should already be a certain performance difference in favor of the more complex model to justify its use. If it is clear at which difference there is a real added value of the ML approach, this can be used as a criterion and thus another gate can be integrated into the process. If the baseline is not beaten, further optimization of the data or the training is necessary or even the chosen approach has to be reconsidered.

Examples for Model Quality Gates within a Machine Learning Pipeline

Examples of Model Quality Gates

A/B tests

However, the actual performance of the model will probably only become apparent in a real-life situation, for example with the help of A/B tests. The A/B test (also split test) is a test method for evaluating two variants of a system, in which the original version is tested against a slightly modified version. [3] . If the infrastructure supports A/B tests, these can be run and provide helpful insights.
For example, correlations between the previously tested offline metrics and the results of the online metrics of the A/B tests might exist. This makes it possible to assess to what extent the offline metrics can predict the performance of the models at all, or which ones should most likely be considered.
What can be done and assessed manually in advance is an A/B test between a first trained model and the baseline.
In addition, if further model approaches are to be tested, the last best model can of course also be compared with a new potentially better model. Thus A/B testing can also be integrated as a quality gate before deployment at the end of the pipeline. However, A/B tests might be time-consuming and therefore not always practical in an automated way.

Production

One difference, compared to the project management approach, however, is that the typical ML process runs in cycles. Ultimately, measuring the quality of a model in the various gates within a machine learning pipeline is just a bet on the future.
Once the model is in production, it is subject to some degeneration, which can vary in degree and speed depending on the domain. This means that models become obsolete over time and may need to be quasi-adapted to new circumstances or data. This can be achieved, for example, by renewing or further training the models, which starts the process all over again.
Quality gates are therefore also to be used during the operation of the models in order to determine whether the model delivers or can continue delivering adequate results in a constantly changing environment.
In the case of a ‘data drift’ the data could have been changed, like the distribution or value ranges of the data or features. Or the patterns that the model has learned are no longer valid and are subject to a concept drift. [4] . For example, seasonal effects or, as just happened in times of a pandemic, unexpected effects can influence the prediction quality of models [5] .
Therefore, even when the model is running, a further gate can regularly check new data and verify that it still matches the assumptions.

Machine Learning quality gates chain including monitoring of current model quality

How often a model has to be retrained and which changes have to be made depends on the application and the characteristics of the new conditions. Whether it is worthwhile at all is also to be estimated, since if necessary the costs around new models to train can be not insignificant.
This trade-off between cost and benefit or even criticality of the model is also a gate. But that gate is only opened for the next step, the new training, if the model is ‘bad’ enough.

Conclusion

As exemplified here, quality gates can be integrated at various points in a pipeline to ensure the quality of the respective artifacts of a sub-process. For this purpose, success criteria that must be met in order to pass a gate must be defined in advance and can be checked automatically.
Stopping a machine learning pipeline early can save time, resources and money. After all, if a gate fails, it is foreseeable that in the end a sufficient prediction quality of the model can no longer be guaranteed.
Finally, models must also be monitored in production to detect potential model degeneration at an early stage and respond accordingly.
In addition, a number of other gates and tests can be considered to ensure that the ecosystem around the machine learning approach works in practice. However, this is beyond the scope of this article, but should not go unmentioned here [6] .

References

[1] codecentric Blog – Evaluating machine learning models: The issue with test data sets

[2] codecentric Blog – Evaluating machine learning models: How to tackle metrics

[3] wikipedia – AB-Test

[4] wikipedia – Concept drift

[5] Fortune – A.I. algorithms had to change when COVID-19 changed consumer behavior

[6] Google, Inc. – The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction

Was this post helpful?

Blog author

Berthold Schulte

Consultant Data & AI

Do you still have questions? Just send me a message.

fromBerthold Schulte

Evaluating machine learning models: The issue with test data sets

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends...

Data
Machine Learning
Data Science

25.3.2020 | 6 minutes reading time

Berthold Schulte

Evaluating machine learning models: How to tackle metrics

Once a model has been trained, it can be evaluated in different ways and with more or less complex and meaningful procedures and metrics. However, the number and possible criteria for evaluating machine learning models can initially be quite confusing...

Data
Machine Learning
Software development

1.7.2019 | 14 minutes reading time

Berthold Schulte

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 [Missing String "readingTime"]

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 [Missing String "readingTime"]

Hendrik Kamp

20 years of coding

We all grow older. It is simply inevitable. As the saying goes, The only way to not grow old is to die young. Recently, I've completed my 20th year in the development industry. Through academia, consulting, and a stint in product development, I've learned...

Software development
Training
Culture

11.4.2025 | 10 [Missing String "readingTime"]

Elisabeth Schulz

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 [Missing String "readingTime"]

Daniel Kocot

Miriam Greis

Hexagonal Architecture is just an island

Imagine an island called "Alistair Island." This island is a vibrant place with houses, fertile soil, and a well-coordinated community of residents who live by well-defined routines. Every activity on the island has significance and serves a specific...

Software architecture
Testing
Software development

22.1.2025 | 10 [Missing String "readingTime"]

Danny Steinbrecher

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 [Missing String "readingTime"]

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 [Missing String "readingTime"]

Daniel Kocot

ArchUnit in practice: Keep your Architecture Clean

Who hasn’t been there: A new project kicks off or the old code finally needs a cleanup. A big meeting with all the developers is called: “This time, we’ll do it right—clean, correct, and structured!” Architecture Decision Records (ADRs) are created to...

Software architecture
Java
Kotlin
Software development

20.9.2024 | 18 [Missing String "readingTime"]

Danny Steinbrecher

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 [Missing String "readingTime"]

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 [Missing String "readingTime"]

Daniel Kocot

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

In a recent blog post, we explored how Dapr works and how to test it on a simple local Kubernetes cluster. One of Dapr's key advantages is its component system, which enhances portability. In this post, we'll take our previously daperized demo app and...

Software development
Cloud
Azure
Cloud native

22.7.2024 | 10 [Missing String "readingTime"]

Manuel Zapf

React is dead, long live React - React 19 is here

The world of frontend development has changed once again, and this time React 19 is leading the way. This version brings a variety of new features and improvements, but the most exciting innovation is the brand new compiler, which already requires React...

React
Frontend
Software development
JavaScript
Webdevelopment

19.7.2024 | 6 [Missing String "readingTime"]

Michel Ehmen

Exploring Dapr: A Deep Dive into Distributed Application Runtime

In a recent blog post, we introduced Dapr (Distributed Application Runtime) and highlighted its potential as a valuable tool for cloud-native applications, in combination with Aspire. This post dives deeper into the inner workings of Dapr, explaining...

Software development
Cloud native
Software architecture
Open Source

10.7.2024 | 10 [Missing String "readingTime"]

Manuel Zapf

Spring Boot and HTMX: The boring app

Motivation Most apps I touched in the wild follow the same two tiered approach. A backend delivering JSON (some may call this REST) and a frontend framework, consuming JSON from the backend converting it to the HTML displayed to the user. Worst case,...

Software architecture
Software development
Spring
Kotlin

28.6.2024 | 16 [Missing String "readingTime"]

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

Dr. Florian Rademacher

Charge your APIs Volume 25: Contract Testing

I feel the way we do integration testing is sort of like setting your house on fire to test your smoke alarm. It is excessive, tiresome and way too costly. This is not a quote from myself. I typically don't come up with such good ideas when I need....

Testing
Software development
API

2.4.2024 | 11 [Missing String "readingTime"]

Pasquale Brunelli

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

How to gain visibility as a software developer?

No matter if junior, medior or senior, introverted or extroverted: Every software developer can increase their visibility with different tools and should treat the topic as important. The only question is: how and with what effort? In this blog post,...

Training
Software development
Community
Open Source

21.2.2024 | 6 [Missing String "readingTime"]

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 [Missing String "readingTime"]

Francesca Diana

Evaluating machine learning models: Establishing quality gates

Quality gate

Data

Metrics

Baselines

A/B tests

Production

Conclusion

References

Was this post helpful?

Blog author

More articles

Evaluating machine learning models: The issue with test data sets

Evaluating machine learning models: How to tackle metrics

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

20 years of coding

Introducing Data Interface Quadrants (DIQs)

Hexagonal Architecture is just an island

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

ArchUnit in practice: Keep your Architecture Clean

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

React is dead, long live React - React 19 is here

Exploring Dapr: A Deep Dive into Distributed Application Runtime

Spring Boot and HTMX: The boring app

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

Charge your APIs Volume 25: Contract Testing

A/B Testing: Tool support and testing GrowthBook

How to gain visibility as a software developer?

A/B Testing: An introduction