Evaluating machine learning models: How to tackle metrics

1.7.2019 | 14 minutes reading time

Once a model has been trained, it can be evaluated in different ways and with more or less complex and meaningful procedures and metrics. However, the number and possible criteria for evaluating machine learning models can initially be quite confusing to someone who is just starting to deal with the field of machine learning.

For example, it depends on whether the learning is un-supervised or supervised. In the case of supervised learning it also depends on whether we are dealing with a regression or classification, the underlying use case, and so on – to name just a few criteria. I would like to start with supervised learning and classification. In this article I will introduce seven common metrics and methods for evaluating machine learning models using one example throughout this post. However, a detailed discussion would be beyond the scope of this introduction, so I will only touch only briefly on the mathematics underlying the metrics.

Use case

A manufacturer of drinking glasses wants to identify and sort out defective glasses in his production. A model for image classification is to be trained and used for this purpose. The database consists of images of intact and defective glasses. Intact glasses are represented by 0 in this binary classification; accordingly, defective glasses are represented by 1.

In the following test data (y_true), for example, eight defective and two intact glasses are present and seven of them are correctly classified as defective by the model (y_pred).


# Test data  
y_true: [0, 1, 1, 1, 0, 1, 1, 1, 1, 1]

# Forecast/prediction of the model
y_pred: [0, 0, 1, 1, 0, 1, 1, 1, 1, 1]

Accuracy

Accuracy is probably the most easily understood metric and compares the number of correct classifications to all the classifications to be made.

In the following example, only one value was wrongly predicted and therefore an accuracy of 90% was achieved.


y_true:   [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  
y_pred:   [0, 1, 0, 0, 0, 1, 1, 1, 1, 1]  
Accuracy: 90%.

If the classes in the data are extremely unevenly distributed, viewing the accuracy alone can unfortunately lead to wrong conclusions. For example, if one class exists nine times more often in the data than the other, simply predicting the more frequent class is enough to achieve 90% accuracy – although the model may not be able to predict the other class. Whether the model is able to distinguish defective glasses from intact glasses is not assessable with this metric and a class distribution, since it always seems to predict a defective glass, as seen in the following example.


y_true:   [0, 1, 1, 1, 1, 1, 1, 1, 1, 1]  
y_pred:   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]  
Accuracy: 90%.

In addition to such imponderables, it naturally also depends on the use case which metrics can be used for evaluating machine learning models.

Binary classification

First, a binary classification [1] can distinguish four cases:

The class was predicted – which is correct (hit, true positive)
The class was not predicted – which is correct (reject, true negative)
The class was predicted – which is wrong (false alarm, false positive)
The class was not predicted – which is wrong (miss, false negative)

For these four cases, however, different designations are common [2]. Since hit, correct reject, false alarm and miss seem to be most descriptive to me, I will use them here.

Trade-Off between coverage and precision

If, for example, it is important to recognize a class in the data as comprehensively as possible, the coverage (recall, hit rate, true positive rate) can be used as a quality criterion. Recall represents the number of hit in relation to the sum of hit and miss, i.e. all defective glasses to be detected:
Recall
If the model as above simply predicts a defective glass, i.e. only one class, the coverage is of course still hit=9 and miss=0 and thus Recall=9/\(9+0\)=1.

In contrast to this, precision describes the ratio of the number of hit to the sum of hit and false alarm, i.e. all glasses supposedly detected as defective:

At this point, predicting only one class would of course increase the number of false alarm to 1 and thus result in less precision – (Precision=9/\(9+1\)=0,9.

Finally, the coverage describes how many supposed hits there were and the precision how many were correct. Achieving both complete coverage and high precision is desirable. In practice, however, this will not always be possible and the focus will not always be on both. The F-measurement can be evaluated so that the two metrics do not have to be checked and weighed up against each other at the same time.

F-measurement / F1-Score

The F-measurement describes the harmonic mean between recall and precision and combines the two metrics to one value [3].
F1-Score
The use of the harmonic mean means that the metric is somewhat more ‘sensitive’ to the smaller of the two values than it would be in the arithmetic mean.
For example, a recall value of (1,0) and a precision value of (0,1) results in an arithmetic mean of 0,55 – which intuitively does not do justice to the small value of precision. However, the harmonic mean is calculated to 0,18 and points to a worse model.

The F1 score is derived from a more general variant, the F-beta score.
F1-Beta Score
With the F1 score, *beta* is set to 1 and thus recall and precision are weighted equally. If the application requires it, Recall and Precision can also be weighted differently. A (beta) value between 0 and 1 weights the precision more strongly, a value greater than 1 weights the recall more strongly.

However, even in this metric an unequal distribution of the classes in the data can falsify the result. In addition, the case correct recject is not considered in the F measure. A method would be desirable that considers all above mentioned cases of a binary classification, proves to be robust against unbalanced class distributions and is easy to represent.

Confusion Matrix

The comparison of the model’s predictions and the correct values in the form of a contingency table provides a more detailed insight into the model’s performance. For each case, the number of times it occurred in the result is counted, and the number is entered into a contingency table. The rows describe the predictions and the columns show the correct values. In the binary case this table is also called Confusion Matrix [2].
The four cases can be found in this matrix, for example:

In the following example, the classes are not distributed equally but are in the ratio (7:3).


y_true: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1] 
y_pred: [0, 1, 1, 0, 1, 1, 1, 1, 1, 1]

hit = 6, false alarm = 2, miss = 1, correct reject = 1

The model has six defective glasses correctly detected (hit) and one correctly classified as correct reject. However, two intact glasses were detected as defective (false alarm) and one defective glass was not detected (miss). The aim is to achieve the highest possible values in the matrix on the diagonal from top left to bottom right – i.e. high values only for hit and correct reject.

The matrix considers all cases, but keeping track is not always easy, and a statement about the performance of the model is intuitively not possible on the fly. Nevertheless, one gets an impression of how the results relate to each other. For example, the values that influence Recall and Precision can be found directly in the matrix.

Matthews Correlation Coefficient (MCC)

A metric that summarizes all cases and is considered suitable for application to data sets with unbalanced class distributions is Matthews Correlation Coefficient [4,5,6]. It can be read and calculated from the Confusion Matrix as follows:

Matthew’s Correlation Coefficient

The value range of the metric is between -1 and 1. The value 1 is desirable, 0 is random, and negative values indicate a contradictory assessment of the model. However, the MCC is not defined if one of the four sums in the denominator is 0.

The crux

The values for Accuracy, Precision and Recall in the above example of the Confusion Matrix are:


Accuracy:  0.70
Precision: 0.75
Recall:    0.86

If we compare this model to another only on the basis of accuracy, this could already lead to the exclusion of the model at this point.
A model with good accuracy and bad values for recall and precision could be preferable, but not necessarily the best model for this application. In this case, the metrics Recall and Precision, for example, would be more interesting, since they deal more precisely with the requirements and thus the model can be better assessed and compared.
This means that pure confidence in accuracy can again lead to unfavorable conclusions about the performance of a model, depending on the application. This phenomenon is commonly known as the Accuracy Paradox [7].

In addition, depending on the distribution of classes in the data, some metrics may not represent changes in the predictions or may tend in different directions. The following examples give another small impression of how different the ratings can be.


y_true:   [0, 1, 1, 1, 1, 1, 1, 1, 1, 1]   
y_pred:   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]   
Accuracy: 0.90
F1 score: 0.95 (0.9474)
Matthews CC: 0.00

With a value of >= 0.90, F1 and Accuracy already seem to be a quite good model. However, the Matthews Correlation Coefficient clouds the picture of the model and points to randomly correct predictions, since the intact glasses are extremely underrepresented and this one was even wrongly classified.

In the following case, undamaged glasses are more often present in the data and thus allow for a better evaluation.


y_true:   [0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
  y_pred:   [1, 0, 1, 1, 1, 1, 1, 1, 1, 1]  
Accuracy: 0.90  
F1 score: 0.94 (0.9412)
Matthews CC: 0.67 (0.6667)

Matthews Correclation Coefficient now points to a better model with a value of 0.67, since one of the two intact glasses was recognized correctly at least once. Accuracy and F1 do not react at all or only slightly to the change and they are calculated to the same or quite similar values as in the previous example. But will the decisions still be made by chance? After all, the two intact glasses are classified right in one case and once wrong in the other.

In the following example, three good glasses have been predicted twice correctly and once wrongly, so that the performance of the model can be evaluated even better. Matthews Correlation Coefficient acknowledges this with an increase from 0.67 to 0.76 and F1 with a further devaluation from 0.94 to 0.93.


y_true:   [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]  
y_pred:   [1, 0, 0, 1, 1, 1, 1, 1, 1, 1]  
Accuracy: 0.90  
F1 score: 0.93 (0.9333)
  Matthews CC: 0.76 (0.7638)

This means that it is advisable to use the values of several metrics when comparing models and especially in case of unbalanced class distributions and to determine in advance which metrics are relevant for the application case.

However, what is the minimum score that we can be satisfied with regarding this model? This, of course, depends on the use case or the requirements and is often difficult to assess if there are no comparative variables such as the reliability of a human decision maker.
Further graphical presentation of the performance of a model can be instructive and help achieve a good trade-off.

Trade-Off between hit and false alarm rate

While the glass manufacturer’s sales department strives to achieve the highest possible throughput in production, the quality and legal department will be more eager to minimize the number of defective glasses going on sale. This means that the sales department wants to avoid waste consisting of glasses unnecessarily classified as defective (false alarm). The quality assurance department, on the other hand, insists on recognising all defective glasses (hits) – even if this means that a few intact glasses cannot be sold.

Receiver Operating Characteristic

The Receiver Operating Characteristic curve (ROC curve) [8] compares the proportion of objects correctly classified as positive, i.e. the hit rate, with the proportion of objects falsely classified as positive, i.e. the false alarm rate, in a diagram. The false alarm rate describes the ratio of the number of glasses falsely detected as defective to the number of glasses detected as intact.

This means that the proportion of glasses correctly identified as defective is compared with the proportion of glasses falsely identified as defective.

Additionally it is also a matter of optimizing a threshold value. Depending on the method used, the predictions of a model will not always only output the numbers 0 and 1 for classification. Rather, the values lie between 0 and 1 and thus describe a probability of belonging to a class, which is then to be interpreted. This means at what value do you want to accept that a class has been recognized?

The values of the hit rate are entered on the vertical per threshold value and the values of the false alarm rate on the horizontal. The diagonal from bottom left to top right describes the random limit. That means curves that tend far above this limit and into the upper left corner represent a good evaluation. The hit rate there is as high as possible and the false alarm rate very low at the same time. The result is this typical curve (black in the diagram) on which you can now decide which trade-off to enter, or – if you enter several models – which model can be used with which threshold value, or whether to continue training.

With the model in the following example (orange curve in the diagram), a hit rate of 0.86 could be achieved if a false alarm rate of 0.33 is accepted and a threshold value of 0.3 is applied. The thresholds themselves are not shown in the curve, but are listed in the following table.


y_true: [0    0    0    1    1    1    1    1    1    1   ] 
y_pred: [0.22 0.24 0.40 0.23 0.30 0.42 0.70 0.50 0.60 0.80]

Hit rate:         [0.00 0.14 0.71 0.71 0.86 0.86 1.00 1.00]
 False alarm rate: [0.00 0.00 0.00 0.33 0.33 0.67 0.67 1.00] 
Thresholds:       [1.80 0.80 0.42 0.40 0.30 0.24 0.23 0.22]

If you insist on a false alarm rate of 0, a hit rate of 0.71 is possible. For this purpose, the threshold value must be set to at least 0.42. In practice, of course, there are more data records available and correspondingly much finer gradations from which a threshold value can be selected. In addition, the area under the curve Area Under Curve AUC can be interpreted with a value between 0 and 1. In this case, 1 once again represents the best value and 0.5 represents the coincidence or in this case also the worst value [9]. The AUC values for the curves are listed in the key of the diagram.

Evaluating machine learning models – Conclusion

In general, you can try to train a model that achieves the best possible values for all metrics. In practice, however, the effort required to achieve the last per mille of improvement often does not justify the actual benefit. A pragmatic, application-oriented approach is more likely to lead to success and provides the necessary leeway to achieve a meaningful trade-off. After all, with the test results and the resulting metric values, the models can only be compared with each other. How good a model really is will then be shown in practice and a justification as to how the model came to its decision has not yet been given, but in some areas it may well be necessary [10].

But how much test data is actually necessary for a meaningful evaluation?
Read in: Evaluating machine learning models: The issue with test data sets ,
what influence the size of the test data set can have on the comparability of models.

References

1] Wikipedia, Evaluation of a binary classifier
[2] Wikipedia, Confusion Matrix
[3] TYutaka Sasaki, 2007, The truth of the F-measure
4] Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12(6): e0177678. doi.org/10.1371/journal.pone.0177678
[5] Davide Chicco, 2017, Ten quick tips for machine learning in computational biology. doi:10.1186/s13040-017-0155-3
[6] Wikipedia, Matthews correlation coefficient
[7] Wikipedia, Accuracy paradox
[8] Tom Fawcett, 2005, An introduction to ROC analysis, doi.org/10.1016/j.patrec.2005.10.010
[9] Wikipedia, Receiver Operating Characteristic
[10] Shirin Elsinghorst, Explanability of Machine Learning Methods, The Softwerker Vol. 13, The Softwerker Vol. 13

Was this post helpful?

Blog author

Berthold Schulte

Consultant Data & AI

Do you still have questions? Just send me a message.

fromBerthold Schulte

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 minutes reading time

Berthold Schulte

Evaluating machine learning models: The issue with test data sets

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends...

Data
Machine Learning
Data Science

25.3.2020 | 6 minutes reading time

Berthold Schulte

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

20 years of coding

We all grow older. It is simply inevitable. As the saying goes, The only way to not grow old is to die young. Recently, I've completed my 20th year in the development industry. Through academia, consulting, and a stint in product development, I've learned...

Software development
Training
Culture

11.4.2025 | 10 minutes reading time

Elisabeth Schulz

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Hexagonal Architecture is just an island

Imagine an island called "Alistair Island." This island is a vibrant place with houses, fertile soil, and a well-coordinated community of residents who live by well-defined routines. Every activity on the island has significance and serves a specific...

Software architecture
Testing
Software development

22.1.2025 | 10 minutes reading time

Danny Keller

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

ArchUnit in practice: Keep your Architecture Clean

Who hasn’t been there: A new project kicks off or the old code finally needs a cleanup. A big meeting with all the developers is called: “This time, we’ll do it right—clean, correct, and structured!” Architecture Decision Records (ADRs) are created to...

Software architecture
Java
Kotlin
Software development

20.9.2024 | 18 minutes reading time

Danny Keller

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

In a recent blog post, we explored how Dapr works and how to test it on a simple local Kubernetes cluster. One of Dapr's key advantages is its component system, which enhances portability. In this post, we'll take our previously daperized demo app and...

Software development
Cloud
Azure
Cloud native

22.7.2024 | 10 minutes reading time

Manuel Zapf

React is dead, long live React - React 19 is here

The world of frontend development has changed once again, and this time React 19 is leading the way. This version brings a variety of new features and improvements, but the most exciting innovation is the brand new compiler, which already requires React...

React
Frontend
Software development
JavaScript
Webdevelopment

19.7.2024 | 6 minutes reading time

Michel Ehmen

Exploring Dapr: A Deep Dive into Distributed Application Runtime

In a recent blog post, we introduced Dapr (Distributed Application Runtime) and highlighted its potential as a valuable tool for cloud-native applications, in combination with Aspire. This post dives deeper into the inner workings of Dapr, explaining...

Software development
Cloud native
Software architecture
Open Source

10.7.2024 | 10 minutes reading time

Manuel Zapf

Spring Boot and HTMX: The boring app

Motivation Most apps I touched in the wild follow the same two tiered approach. A backend delivering JSON (some may call this REST) and a frontend framework, consuming JSON from the backend converting it to the HTML displayed to the user. Worst case,...

Software architecture
Software development
Spring
Kotlin

28.6.2024 | 16 minutes reading time

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 25: Contract Testing

I feel the way we do integration testing is sort of like setting your house on fire to test your smoke alarm. It is excessive, tiresome and way too costly. This is not a quote from myself. I typically don't come up with such good ideas when I need....

Testing
Software development
API

2.4.2024 | 11 minutes reading time

Pasquale Brunelli

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

How to gain visibility as a software developer?

No matter if junior, medior or senior, introverted or extroverted: Every software developer can increase their visibility with different tools and should treat the topic as important. The only question is: how and with what effort? In this blog post,...

Training
Software development
Community
Open Source

21.2.2024 | 6 minutes reading time

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Evaluating machine learning models: How to tackle metrics

Use case

Accuracy

Binary classification

Trade-Off between coverage and precision

F-measurement / F1-Score

Confusion Matrix

Matthews Correlation Coefficient (MCC)

The crux

Trade-Off between hit and false alarm rate

Receiver Operating Characteristic

Evaluating machine learning models – Conclusion

References

Was this post helpful?

Blog author

More articles

Evaluating machine learning models: Establishing quality gates

Evaluating machine learning models: The issue with test data sets

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

20 years of coding

Introducing Data Interface Quadrants (DIQs)

Hexagonal Architecture is just an island

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

ArchUnit in practice: Keep your Architecture Clean

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

React is dead, long live React - React 19 is here

Exploring Dapr: A Deep Dive into Distributed Application Runtime

Spring Boot and HTMX: The boring app

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

Charge your APIs Volume 25: Contract Testing

A/B Testing: Tool support and testing GrowthBook

How to gain visibility as a software developer?

A/B Testing: An introduction