The Machinery behind Machine Learning – A Benchmark for Linear Regression

14.1.2016 | 10 minutes reading time

This is the third post in the “Machinery behind Machine Learning” series and after all the “academic” discussions it is time to show meaningful results for some of the most prominent Machine Learning algorithms – there have been quite a few requests from interested readers in this direction. Today we are going to present results for Linear Regression as prototype for a Regression method, follow-up posts will cover Logistic Regression as prototype for a Classification method and a Collaborative Filtering / Matrix Factorization algorithm as prototype for a Recommender System. It is worth noting that these prototypes for Regression, Classification and Recommendation are relying on the same underlying Optimization framework – despite their rather different interpretation or usage in the field of Machine Learning.

To be a bit more precise: In all three applications the training part is driven by an algorithm for Unconstrained Optimization, in our case Gradient Descent. As we want to emphasize the tuning options and measure the impact of the stepsize in particular we provide a comparison of the standard Armijo rule, the Armijo rule with widening and the exact stepsize. In later articles – after we learned something about Conjugate Gradient and maybe some other advanced methods – we are going to repeat this benchmark, but for now let’s focus on the performance of Gradient Descent.

Linear Regression

Linear Regression is a conceptually simple approach for modeling a relationship between a set of numerical features – represented by the independent variables \(x_1,…,x_n\) – and a given numerical variable \(y\), the dependent variable. When we assume that we have \(m\) different data points or vectors \(x^{(j)}\) and values or real numbers \(y_j\), the model takes the following form:

\(y_j \approx c_0 + c_1 x^{(j)}_1 + … + c_n x^{(j)}_n\)

with \(c_i,\ i=0,..,n\), being some real-valued parameters. Using the matrix-vector notation from Linear Algebra we derive a more compact formulation. We put the parameter values \(c_i\) in the parameter vector \(c\) of length \(n+1\), collect the row vectors \(x^{(j)}:=[1,x^{(j)}_1,…,x^{(j)}_n]\) into a \(m\times (n+1)\)-matrix \(X\) – the leading \(1\) in each vector belongs to the coefficient \(c_0\) – and end up with something close to a linear system of equations:

\(Xc\approx y\).

The interpretation is that we want to find a parameter vector c that satisfies the linear system of equation as good as possible, thus we are looking for the best approximate solution because an exact solution does not exist in general if \(m>n+1\), which is assumed to be the case here.

The objective function for Linear Regression

One mathematical translation of as good as possible is to minimize the residual or error measured by the (squared) Euclidean norm:

\(\min_c f(c):=\|Xc – y\|^2\).

The Euclidean norm is the usual notion of distance so nothing spectacular here. We square the expression to get rid of the square root that hides within the norm, it’s simpler and better from a computational point of view. Of course it does influence the concrete value of the error but does not change the solution or optimal parameter vector \(c^*\) that we are looking for.
Now we have arrived at an unconstrained minimization problem, the process of minimizing the error theoretically involves all possible values for the parameters \(c_i\), there are no restrictions. It’s time to show what we have learned so far. First, let’s unfold the compact expression to see what exactly is measured by the objective function \(f\) defined above:

\(f(c)=\|Xc – y\|^2 = \sum_j \left( c^Tx^{(j)}-y_j \right)^2\).

In the implementation we have included an optional scaling factor of \(1/(2m)\), that normalizes the value of the objective function with respect to the number of data points. It’s presence or absence does not change the concrete solution, it’s an implementational detail that we have omitted here. The important ingredients are the summands \(\left( c^Tx^{(j)}-y_j \right)^2\) that quantify the pointwise deviation of the model from the input data.

Visualizing Linear Regression

Effectively we sum up the squared prediction errors for every data point, as is illustrated in the following plot. This example has been generated using the mtcars dataset that comes with R. The point-wise (squared) error is the (squared) length of the vertical line between the true data point (black) and the predicted point (blue) on the regression line.

The variables for the optimization algorithm are the coefficients \(c_i,\ i=0,…,n\), and the objective function \(f(c)\) is nothing but a polynomial of degree \(2\) in these variables. In fact there is a structural similarity to some of the simple test functions from part 2 of this blog series, but now let us look at the concrete test case. As usual you can find all information and the R code in the github repository , the code is self-contained.

The algorithmic setup

As we want to check the scaling behaviour in higher dimensions I have decided to create artificial data for this test. The setup is as follows:

The first column of the \(m\times (n+1)\)-matrix \(X\) contains only \(1\)s, the remaining \(n\) columns contain random numbers uniformly distributed in the unit interval \([0,1]\).
We define an auxiliary vector \(\hat{c}\) of length \(n+1\) by\(\hat{c} := [1,2,3,…,n+1]\).
We define a random vector \(z\) of length \(m\) – the number of data points – containing random numbers uniformly distributed in the unit interval, scaled to unit length. \(z\) is used as noise generator.
We define a weight \(\varepsilon\) – a real number that allows us to scale the noise vector \(z\).
Finally, we define the vector \(y\) by\(y:= X\hat{c} + \varepsilon z\).

The vector \(\hat{c}\) is by construction an approximate solution of \(y\approx Xc\) as long as the noise \(\varepsilon z\) is small. This setup might look somewhat complicated but it gives us a nice parametrization of the interesting things. The main parameters are

The number of data points: \(m\).
The number of parameters of the linear model: \(n+1\).
The amount of noise, i.e. a trivial upper bound of the expected value of the objective function: \(\varepsilon\).

The last parameter allows for a quick sanity check, as \(\varepsilon^2\) always is a trivial upper bound because we already know the possible solution \(\hat{c}\) with the property
\(f(\hat{c})=\varepsilon^2\).

Furthermore it makes the scenario somewhat realistic. Typically you want to reconstruct the unknown solution – modeled by \(\hat{c}\) – but what you get from any algorithm almost always is a perturbed solution \(c^*\) that still contains some of the noise. Using this parametrization you can get a feeling for how much of the added noise actually is present in the solution, which might lead to a deeper understanding of Linear Regression.

Benchmarking Linear Regression

We compare the performance of the standard Armijo rule, the Armijo rule with widening, the exact stepsize for several choices of \(m,n\) and \(\varepsilon\). We also include a single comparison for some choices of a fixed stepsize at the end, that indicate what you can expect from this choice and where it can fail. In all cases, the algorithm terminates if the norm of the gradient is below \(1e-6\) or if the number of iterations exceeds \(100,000\).

Impact of data complexity

The first test is about the simplest possible model, one-dimensional Linear Regression, i.e. the number of model parameters is \(2\). Don’t be confused by this, there is always one parameter for the “offset” \(c_0\), such that the \(n\)-dimensional model depends on \(n+1\) parameters. The optimization algorithm is initialized with the parameter vector \(c=(c_0,c_1)=(0,0)\). We vary the number of data points from \(5\) to \(10,000\) in order to experience the scaling with respect to the amount of data.

That’s more or less the expected behaviour. The number \(m\) of data points can grow but the underlying optimization problem still has the same dimension \(n+1=2\), thus the number of iterations remains roughly constant – given that there are sufficiently many data points – and the runtime increases linearly with \(m\). It is interesting that the exact stepsize needs so few iterations making it almost as fast as the standard Armijo rule, this indicates the simple structure of the problem. Nevertheless, the Armijo rule with widening is the best choice here.

Impact of model complexity

The second test varies the number of model parameters from \(2\) to \(16\), the number of data points is fixed at \(1,000\). Here, we expect the scaling behaviour to be somewhat different.

Still, the exact stepsize produces the smallest number of iterations but the difference is much smaller for more complex models which indicates that inexact stepsizes are a good choice here. On the other hand, the price for this slight advantage is prohibitively high. We can also see that not only the runtime per iteration, but also the number of iterations is increasing with the model complexity, this is something to be considered when choosing a model for a real-world problem.

Fixed stepsize

As promised we provide some results for fixed stepsizes as well. Conceptually, it does not make sense to use a fixed stepsize at all unless you do know in advance a good value that definitely works – which you typically don’t. In reality, you have to test several values in order to find something that allows the algorithm to converge in a reasonable time – which is nothing but a stepsize rule that requires a full algorithmic run in order to perform an update. But let’s look at the numbers.

Only for the one-dimensional – and almost trivial – Linear Regression problem there is a value of the stepsize for which the performance is competitive, but this choice is way too risky for non-trivial settings. Even for slightly more complex Linear regression problems and a fixed stepsize of \(1e-1\) the performance is worse by a factor of \(10\). And for more complicated objective functions, e.g. the Rosenbrock function that has been introduced in the last blog post, the value would have to be smaller than \(1e-6\) implying that the number of iterations and the runtime would explode.

Summary

The results for Linear Regression already indicate that it is indeed useful to keep an eye on the underlying machinery of Optimization methods. The Armijo rule with widening shows the best performance, but the exact stepsize leads to the smallest number of iterations. These two findings imply that the direction of steepest descent is a good or at least reasonable choice for Linear Regression. The reasoning behind this is twofold. Remember that one argument for inexact stepsizes was that they can help avoiding the risk of overfitting that the exact stepsize cannot avoid in case the search direction is bad. Overfitting – which can be interpreted as significant deviation between the “local” search direction and the “global” optimal direction pointing directly to the solution – should lead to a higher number of iterations or at least not to less iterations. So in reverse, if the number of iterations is smaller, there can be no overfitting and the search direction should be a reasonable approximation of the unknown optimal direction.
The second argument – which directly applies to widening but also to the exact stepsize – considers the step length. Widening leads to larger steps which again only make sense if the local search direction is a reasonably good approximation of the global optimal direction in a larger environment of the current iterate. As a general rule: Larger steps imply that the model is a “good fit” in a larger environment of the current point. The exact stepsize also did produce larger steps which we did not mention here but you can check it on your own.
Feel free to take a look at the code in the github repository and give it a try yourself, you can even apply it to your own data. For other choices of the parameters the numbers can look differnt, sometimes the widening idea has no effect and sometimes it clearly outperforms the exact stepsize in terms of number of iterations, but my goal was to show you the “average” case as this is what mostly matters in practice.
The results for Logistic Regression are coming soon and here the impact on the runtime and the overall performance will be even more visible. Thanks for reading!

The Machinery behind Machine Learning – Part 2
The Machinery behind Machine Learning – Part 1

Was this post helpful?

Blog author

Stefan Kühn

Do you still have questions? Just send me a message.

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 minutes reading time

Denis Stalz-John

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 minutes reading time

Denis Stalz-John

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 minutes reading time

Berthold Schulte

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) andfine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 minutes reading time

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications includeautomation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 9 minutes reading time

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 minutes reading time

Marcel Mikl

Why user-oriented development is so important – the story of tactics.ai

In this blog post, we want to give you an insight into the product development of tactics.ai. Our initial idea was a data-driven football analysis tool that applies machine learning techniques to analyze the strengths and weaknesses of opponents and ...

Agile
AI
Startup
Machine Learning
Product management

23.8.2020 | 8 minutes reading time

Denis Stalz-John

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

Evaluating machine learning models: The issue with test data sets

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends...

Data
Machine Learning
Data Science

25.3.2020 | 6 minutes reading time

Berthold Schulte

Great Expectations: Validating datasets in machine learning pipelines

Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true...

Python
Data
Machine Learning

17.2.2020 | 7 minutes reading time

Marcel Mikl

Remote training with GitLab-CI and DVC

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or...

Git
Machine Learning
CI/CD
AI
GitLab

27.1.2020 | 15 minutes reading time

Marcel Mikl

AWS SageMaker Machine Learning Data handling

Seven ways of handling image and machine learning data with AWS SageMaker and S3If you start using AWS machine learning services, you will have to dive into data handling with AWS SageMaker and S3. We want to show you seven ways of handling image and...

AWS
Computer Vision
Data
AI
Machine Learning

17.1.2020 | 10 minutes reading time

Realtime face detection and filtering with the Coral USB accelerator

In this blog post we explain how you can build your own face detection application without much machine learning knowledge. Why? At codecentric everyone has one day per week for professional development and training. Among other things we use this time...

Software architecture
Machine Learning

8.11.2019 | 10 minutes reading time

Christoph Knauf

Tackling climate change with machine learning [part 6] – Datasets & further...

Before we get started with this chapter, here is the full summary video, containing all 5 previous parts, enjoy! By loading the video, you agree to YouTube's privacy policy. Learn more Load video Always unblock YouTube The first 5 chapters of this...

Data
AI
Machine Learning

26.9.2019 | 4 minutes reading time

Tackling climate change with machine learning [part 5] – Industry & carbon...

By loading the video, you agree to YouTube's privacy policy. Learn more Load video Always unblock YouTube On 10th of June, 2019, twenty-two AI researchers, including Andrew Ng and Yoshua Bengio, published a paper on how climate change can be tackled...

Data
AI
Machine Learning

25.9.2019 | 5 minutes reading time

Tackling climate change with machine learning [part 4] – Farms & Forests

Data
AI
Machine Learning

24.9.2019 | 4 minutes reading time

The Machinery behind Machine Learning – A Benchmark for Linear Regression

Linear Regression

The objective function for Linear Regression

Visualizing Linear Regression

The algorithmic setup

Benchmarking Linear Regression

Impact of data complexity

Impact of model complexity

Fixed stepsize

Summary

Was this post helpful?

Blog author

More articles in this subject area

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Python on an M1 chip: Running smoothly using Docker

Evaluating machine learning models: Establishing quality gates

The universal recommender in Action(ML)

NER with little data? Transformers to the rescue!

Take control of named entity recognition with your own Keras model!

NER @ CLI: Custom-named entity recognition with spaCy in four lines

DISH-O-TRON – Gather that DATA you must!

Why user-oriented development is so important – the story of tactics.ai

Thinking AI means re-thinking data

Evaluating machine learning models: The issue with test data sets

Great Expectations: Validating datasets in machine learning pipelines

Remote training with GitLab-CI and DVC

AWS SageMaker Machine Learning Data handling

Realtime face detection and filtering with the Coral USB accelerator

Tackling climate change with machine learning [part 6] – Datasets & further...

Tackling climate change with machine learning [part 5] – Industry & carbon...

Tackling climate change with machine learning [part 4] – Farms & Forests