Evaluating machine learning models: The issue with test data sets

25.3.2020 | 6 minutes reading time

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends on the use case and the performance of the trained models. Therefore it is important to clarify how well a model can actually support the respective challenge. In this article I would like to explain how the evaluation of the performance could be interpreted, especially depending on how much test data is or should be available.

Test data scope

By using a retained representative test set and various metrics, scores can be calculated and the models can be evaluated and compared. But the test set is not used in the training. To optimize the model, further validation sets are only generated from the remaining data.

Once meaningful metrics have been found for the respective use case and target values to be achieved have been defined, the question arises as to what extent the values achieved can actually be trusted. After all, these values can only be based on a reduced amount of example data. How much test data is necessary for a meaningful evaluation depends on the score to be achieved and the desired confidence in the evaluation.
However, collecting and, in the case of supervised learning, labeling the data often requires manual steps and may be a cost factor that should not be underestimated. A good trade-off must be found between confidence in the assessment and the expected costs of collecting and preparing the test data.

Use case

For further explanation, I will use the example of the article: Evaluating machine learning models: How to tackle metrics .

“A manufacturer of drinking glasses wants to identify and sort out defective glasses in his production. A model for image classification is to be trained and used for this purpose. The database consists of images of intact and defective glasses.” [1]

The number of pictures of defective glasses is very limited here, so that, with some effort, pictures of about 500 intact and 500 defective glasses will be available for training and testing of the models only after some time – especially because defective glasses occur rather rarely in production. From these 1000 pictures a representative test set is then retained before training.

How much test data is required for a meaningful evaluation of a machine learning model? Or what does “meaningful” mean in this context? Is 10% to 20% of the data base, in this case 100 to 200 images, sufficient?
In this example, the metric accuracy is selected and after a little training and optimization, the model will achieve a performance of 80% of correct classifications – on a basis of 100 test images.
To estimate how trustworthy this result actually is, you can use standard tools of statistics.

Confidence interval

Whether an image has been classified correctly by the model can be seen as an experiment with the two possible results success or failure. Testing a model is also a series of similar independent experiments, so that the binomial distribution or its approximation [2] to the normal distribution can be used to estimate the result.
The extent to which one can now “trust” the determined value can be shown with the help of a confidence interval.
The benefit of a confidence interval is the possibility of quantifying the uncertainty of a sample, for example a test run on 100 images, and the resulting estimate. Estimation because the test data represent only a small part of the possible data set or population, and thus the model has only been tested with a small part of data and not with all data that may ever occur.

“The confidence interval indicates the range which, with infinite repetition of a random experiment, includes the true location of the parameter with a certain probability (the confidence level).” [3]

The interval is represented by a lower and an upper limit value and the assumption that the test runs have been repeated quite often on different independent test data sets of the same size. This means, for example, that on average in 95 % of the imaginary test runs the resulting limit values include the determined score.

Calculating intervals

The limit values can be calculated as follows, for example [4] :

Confidenzinterval

With p=1/score, n is the number of data and z is a constant that can be read from the standard normal distribution table for the desired “confidence” (confidence level). Common values are, for example:

level	90 %	95%	97%	99%	99.5 %
z	1.28	1.64	1.96	2.33	2.58

In case of the 95% confidence level, with 100 test data sets and a measured score of the model of 80%, the interval is 72% to 88%. This range seems to be quite large and probably not accurate enough for some applications.

The crux

But even when doubling the test data to 200 data sets, the resulting interval: 74% to 86% is not much smaller. The following diagram shows a few more examples of accuracy scores of 80%, 90%, 95% and 99% for the 95% confidence level and for the test data size 100, 200, 1000, 10000. On a basis of 10000 data sets, the range is +-1% and could be acceptable for a score of 80%.

However, for an accuracy of 85% determined on basis of 100 test data, the interval would be 78% to 92%. It therefore also covers a score value of 80%. This suggests that it may be possible to work with less training data and better equip the test data set. Finally, it is possible that if a poorer score is obtained, for example by training on less data, the confidence interval limits will still include the originally better score.
Furthermore, focusing on the last per mil improvement, determined on the basis of a small test set, may not be a goal-oriented endeavour. Or even the effect may occur that after increasing the test data, a model that may not have appeared to be so good before may perform better than the model that was originally preferred due to an insignificantly higher score.
This means that a statement about the performance of the model and differentiation from other models on the basis of a manageable number of test data is only possible to a limited extent.
In general, the larger the sample from which the estimate was drawn, the more precise is the estimate and the smaller and better is the confidence interval.

Conclusion

Ultimately, the evaluation of a model should be done with a sense of proportion and the size of a test set should be included in the evaluation. Particularly with results that do not differ significantly, the selection of a model based solely on these evaluations may not always be promising. A field test in practice, for example by A/B testing, with several models that cannot be clearly distinguished, can support a decision.

References:

[1] codecentric blog, Evaluating machine learning models: How to tackle metrics
[2] Wikipedia, De Moivre–Laplace theorem
[3] Wikipedia, Confidence interval (German version)
[4] Wikipedia, Binomial proportion confidence interval

Was this post helpful?

Blog author

Berthold Schulte

Consultant Data & AI

Do you still have questions? Just send me a message.

MotherDuck Dives: From Natural Language to Live Dashboards

Dives are interactive visualizations created through natural language, built directly on top of data in MotherDuck. Users describe what they want to see, and an AI agent generates a persistent, interactive component that lives in their workspace alongside...

MotherDuck
Data
Data Science
Big Data

9.3.2026 | 8 minutes reading time

Niklas Niggemann

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

After the publication of our article about Ibis, Dr André Schemaitat pointed us to a similar tool with growing popularity – Narwhals. Narwhals describes itself as an "extremely lightweight and extensible compatibility layer between dataframe libraries...

Data
Python
Software development

3.3.2026 | 11 minutes reading time

Niklas Niggemann

Talk to your Data Part 3: The Potential of Natural Language

This is the last and final part of our article series covering the new MCP server by MotherDuck. We have already presented the basics and challenges in previous parts. Now, we want to conclude with our findings and comments on the current state and give...

MotherDuck
Data

27.2.2026 | 7 minutes reading time

Hendrik Kamp

Niklas Niggemann

Talk to your Data Part 2: Limits and Performance Enhancements

In part one of this series, we introduced the MotherDuck MCP server in combination with opencode and showcased initial context engineering. We also showed deeper knowledge retrieval using natural language instead of SQL. In this article we will dive ...

MotherDuck
Data

19.2.2026 | 8 minutes reading time

Niklas Niggemann

Hendrik Kamp

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

MotherDuck's new MCP server gives us the opportunity to have a conversation with an AI models like Claude or ChatGPT and ask questions about our data that are directly transformed into SQL. The queries are executed against the actual data in our cloud...

MotherDuck
Data

12.2.2026 | 6 minutes reading time

Niklas Niggemann

Hendrik Kamp

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

In our previous benchmarks, DuckDB consistently outperformed Polars and Pandas on large analytical workloads, but performance comparisons miss a critical question: what happens when you need to move from local DuckDB development to a BigQuery production...

MotherDuck
Data
Big Data
Data Science

10.2.2026 | 6 minutes reading time

Niklas Niggemann

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

Update 02.02.26 – After helpful insights from the Polars team on LinkedIn, we enhanced our benchmark setup with a configuration of Polars where async is forced. This is elaborated in the article. Our previous benchmark compared DuckDB, Polars, and Pandas...

MotherDuck
Data Science
Data

20.1.2026 | 15 minutes reading time

Niklas Niggemann

MotherDuck: Access Management and Scalable Analytics Overview

MotherDuck's architecture for storage management and user access is built on several key design principles that shape how data is organized and shared. To understand how MotherDuck manages access control, you need to understand three key concepts: organizations...

Data
MotherDuck

8.12.2025 | 6 minutes reading time

Hendrik Kamp

DuckDB vs. DataFrame Libraries

Update 10.12.25 – After helpful insights from Polars Engineer Thijs Nieuwdorp following the initial posting of this article, we were able to refactor our use of the deprecated .count() function in Polars, replacing it with the correct .len() function...

MotherDuck
Data
Data Science
Python
Database

1.12.2025 | 10 minutes reading time

Niklas Niggemann

ODPS: The Standard for Data Products

The data landscape in an organization often looks like this: teams gather and produce data everyday. Each team develops their own metadata models and documentation, if there is any at all. Governance policies exist in scattered documentation (spreadsheets...

Data

7.11.2025 | 4 minutes reading time

DuckDB and MotherDuck for customer facing analytics

MotherDuck
Data

21.10.2025 | 5 minutes reading time

Matthias Niehoff

DuckDB’s friendly SQL is a game changer for developer experience

I don’t think anyone will be surprised when I say that SQL is not the nicest language to work with. Some might even say that it has terrible ergonomics, especially for larger and more complex queries. Still, there are very good reasons why SQL is the...

Data
MotherDuck

14.10.2025 | 12 minutes reading time

Zero-ETL with MotherDuck: A Technical Deep Dive

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON...

MotherDuck
Data

7.10.2025 | 6 minutes reading time

Hendrik Kamp

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

Evaluating machine learning models: The issue with test data sets

Test data scope

Use case

Confidence interval

Calculating intervals

The crux

Conclusion

References:

Was this post helpful?

Blog author

More articles in this subject area

MotherDuck Dives: From Natural Language to Live Dashboards

Narwhals: Building Dataframe-Agnostic Libraries with Zero Dependencies

Talk to your Data Part 3: The Potential of Natural Language

Talk to your Data Part 2: Limits and Performance Enhancements

Talk to your Data Part 1: How to generate Insights with MotherDuck MCP...

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

DuckDB vs. Polars: Performance & Memory on Massive Parquet Data

MotherDuck: Access Management and Scalable Analytics Overview

DuckDB vs. DataFrame Libraries

ODPS: The Standard for Data Products

DuckDB and MotherDuck for customer facing analytics

DuckDB’s friendly SQL is a game changer for developer experience

Zero-ETL with MotherDuck: A Technical Deep Dive

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025