The Machinery behind Machine Learning – Part 1

19.3.2015 | 15 minutes reading time

Machine Learning is one of the hottest topics on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize.

This is the first blog post in a small series about the machinery that drives most of the commonly used Machine Learning algorithms. The overall idea is to illustrate the dependance of the performance of ML algorithms on the underlying optimization algorithms. In the end I want to show that the use of appropriate minimization algorithms can have a significant positive influence on speed and even accuracy – I intend to explicitely demonstrate this for a Collaborative Filtering Recommender System.

Most data scientists are somewhat familiar with basic Machine/Statistical Learning algorithms like Linear and Logistic Regression, lots do know something about or use more sophisticated algorithms like Support Vector Machines, Neural Networks, Collaborative Filtering Recommender Systems, maybe even Restricted Boltzmann Machines, Deep Learning tools and so on. But most of these algorithms – let’s restrict ourselves to Supervised Learning here – share some general properties and rely on the same more or less hidden machinery: the toolbox of Optimization Theory. In this first posting I want to introduce the general ideas of this theory and prepare the ground for applying them to real problems. Self-contained example code in R will be available at github, see Optimization Routines . The series starts with looking at the ingredients of a rather general iterative minimization algorithm, then continues with the well-known Gradient Descent method, proceeds with the Nonlinear Conjugate Gradient and finally discusses Quasi-Newton methods, in particular BFGS. The reason for this selection of algorithms is that all of them just use information in terms of function values and gradients, they compete on equal ground – a globalized Newton method would require the second derivate, i.e. the Hesse matrix as well.

Unconstrained optimization

This part (un-?)fortunately it involves a bit of mathematical notation but I will keep it as simple as possible. Given is a function that maps points from the \(n\)-dimensional Euclidean space to the set of real numbers:

\(\displaystyle f:\mathbb{R}^n \rightarrow \mathbb{R}\)

Let’s start with a \(2\)-dimensional example. Consider points \(p=(x,y)\) in the Euclidean plane \(\mathbb{R}^2\) and the function

\(\displaystyle f(p)=f((x,y))=\frac{1}{2}\left(x^2+y^2\right) \)

that determines half of the squared length of vector \(p\), e.g. for \(p = (3, -1)\) the corresponding function value is \(\displaystyle f(p)=f((3,-1)) = 5\).

Looking for local minima

The goal of unconstrained optimization is to find local minima of such functions. The problem that has to be solved is minimizing the function \(f\) over the whole space without any constraints like e.g. nonnegativity of some variables or integrity conditions, formally written as

\(\min\ f(p),\quad p\in \mathbb{R}^n.\)

In Optimization theory we can restrict ourselves to minima because any maximum of \(f\) is a minimum of \(-f\), so in fact this is no restriction but simplifies things a lot. We do not always aim to find the global minimum in \(\mathbb{R}^n\) but instead points that take smaller function values than all of their neighbors, this is called a local minimum.

Local minima often exist even if a global minimum does not, e.g. just look at the plot of the function

\(\displaystyle g(x)=x+2\ {\sin(x)}\).

Finding the global minimum is sort of an ill-posed problem in its full generality because in a general setting we only have local information – function value and gradient at some points – that does not tell us anything about other points far away, thus we would have to check the whole space in the end in order to figure out whether a local minimum is global or not. In some important special cases we can derive this from known properties of the function, e.g. if \(f\) is convex then any local minimum is a global minimum, and if there exists an isolated minimum of the convex function \(f\) then it is the unique global minimum, as it is the case for our example function \(f\).

In order to apply the standard methods from the optimization toolbox we need some basic technical requirements, the problem and the setting have to be well-defined. Obviously there should be one or more local minima and the function has to be continuously differentiable at least once. Since this is the case in almost all supervised learning applications we can safely assume that this is granted. The first derivative describes the slope of the function at the current point. Informally, all we have to ensure is that a well-defined notion of slope exists in all points – this implies that there are no sharp corners/cusps as e.g. in the absolute value function – and that neither the function nor the first derivative have immediate jumps if we slightly change the current point.

A general minimization algorithm

Imagine now that you are located somewhere at a hill in a dry desert (ok, approximately all deserts are dry, this seems to be somewhat constituting for a desert, but you know what I mean, at least it should not be a snow desert). You know that there is one valley or canyon, maybe even several, somewhere near and you want to find a reasonable path that leads you to the bottom of such a valley. At these spots you might find water as all of us have learned in our regular IT survival trainings, especially in north facing canyons. In such a situation it would be very helpful to not have to search more or less randomly or driven by our gut feelings but to follow some rules that guarantee success. You need a recipe for rescue and this is what the meta-algorithm below will provide. In order to increase your chances of finding water at the most probable place – a local minimum – most iterative algorithms consist of an initializing procedure, that chooses the starting point, and three core iteration steps:

Initialize:

Choose a starting point – an initial guess that can either be determined by your situation – your current location in the desert – or that can be actively chosen.

Iterate until convergence:

Check a termination criterion – are we close enough to the minimum?
Find a descent direction – a direction in which the function value decreases near the current point.
Determine a step size – the length of a step in the given direction that leads to a good decrease.

Concrete realisations of this prototype algorithm typically differ in how they implement the last two points of the iteration, the choice of the descent direction and the step size rule. Sometimes, one even determines the step size before fixing the descent direction – e.g. Trust Region methods – which seems to be counterintuitive at first glance. But most minimization methods form a simplified model of the true function around the current point, and the name Trust Region already indicates that the step size then tells us something about how far we can trust our local model. Whatsoever, Trust Region methods are beyond the scope for now, and we now want to look closer at the three iteration steps of the above meta-algorithm.

Checking for optimality

This refers to the first part of the iteration – the termination criterion – and it’s a bit technical. As you might have already guessed from the first plot a candidate \(p^*\) for a minimum always has a certain property, the first derivative, i.e. the gradient has to be zero:

\(\nabla f(p^*) = 0\)

For our \(2\)-dimensional example the gradient is a vector with two components, one for each coordinate axis, and the above equation translates into the vector equation

\(\nabla f(x,y) = (x,y)^T = (0,0)^T\)

i.e. each component of the gradient has to be \(0\). The unique solution in this simple case is \(x = y = 0\), or \(p^* = (0, 0)\). The formal notation \((…)^T\) – T stands for transposed – means that despite the row notation the vector is column vector, which is the standard convention but not really important here. Some textbooks even omit this detail.
A point \(p^*\) with the property \(\nabla f(p^*) = 0\) is called critical point. In general a critical point must not be a minimum but any minimum is a critical point. To better understand why the above equation is important let’s get an intuition what happens to function and gradient near an isolated local minimum \(p^*\). When you approach \(p^*\) from a specific direction – we can choose e.g. the directions along the coordinate axes – the function values have to decrease until you reach the minimum \(p^*\) and then increase again. Obviously this has to be true regardless of the direction you are coming from, because \(p^*\) is an isolated minimum, in a small neigborhood all points have stricter larger function values. The described behaviour directly implies that the gradient has to become zero exactly at the minimum – the slope is zero – and that the gradient changes the sign in each component, i.e. due to the decrease in the one direction and the increase in the opposite direction. To see that think of a simple univariate quadratic function like

\(g(x) = x^2\).

The graph of this function is the standard parabola, the minimum is attained at \(x^* = 0\) and the gradient is given by

\(\nabla g(x) = 2x\),

thus taking negative values for all \(x<0[ latex]="" and="" positive="" values="" for="" all="" [latex]x=""> 0\), both approaching zero when \(x\) tends to zero.

Now we can come up with a first rule for deciding when to terminate the algorithm. A standard termination criterion simply checks whether the gradient is “sufficiently close” to zero as the gradient has to vanish at a minimum:

Termination criterion: Terminate if \(\|\nabla f(p)\| < \varepsilon[/latex] for a given tolerance [latex]\varepsilon > 0\).

There is no need to specify this any further here, you will see a possible implementation later on as well as some other complementary criteria.

Finding a descent direction

A search direction \(s\) is a descent direction if the function value decreases in that direction at least “at the beginning”. The search direction defines a ray, a unidrectional path, and if we make a tiny (infinitesimal) step along that path the function value has to decrease. Intuitively the term descent direction is rather self-explaining but how can we test whether a given direction is a descent direction or not? In practice we cannot make an infinitesimal step and see what happens. Fortunately there is a nice geometric intuition, and based on that that we derive a surprisingly simple test. Here are the contour lines of our example function:

Points on the same contour line (here: the circle segments in grey) share the same function value. The minimum is attained at the point \((x^*,y^*)=(0,0)\) and the function values increase with the radii of the circles. For one specific point there are gradient \(\nabla f(x,y)\), negative gradient \(-\nabla f(x,y)\) and a descent direction \(d\) given in the plot. Furthermore, you can see the tangent to the contour line at that point, and one can immediately see that the gradient is perpendicular/orthogonal to this tangent (do not mix it up with the tangent to the function!). For the geometric interpretation the tangent to the contour line plays a crucial role. Obviously it splits the \((x,y)\)-plane into two halves. We now focus on the angle \(\alpha\) between the gradient and a possible search direction. The half-plane, where the gradient points to, corresponds to angles \(\alpha < 90^{\circ}[/latex] and [latex]\alpha > 270^{\circ}\). The other one, where the negative gradient points to, corresponds to

\(90^{\circ} < \alpha < 270^{\circ}[/latex].

From the plot is immediately plausible that any search direction [latex]s\) with an enclosing angle \(\alpha\) between \(90^{\circ}\) and \(270^{\circ}\), thus pointing into the negative gradient’s half-plane as e.g. the blue \(d\), is a descent direction: There is at least a tiny path through the interior or the contour circle, where the function values decrease before they increase again later on.
So now we have figured out that the angle \(\alpha\) between the gradient \(\nabla f(x,y)\) and a search direction \(s\) reliably tells us whether \(s\) is a descent direction or not. Visually such a check is trivial in the given setting, but in higher dimensions and with less regular functions, i.e. more interesting shapes of the contour lines it is practically impossible.

Descent direction – a simple test

Fortunately there is an almost trivial computation that does the trick here. Lots of you do remember the geometric interpretation of the scalar or dot product of vectors. With \(\|a \|\) being the Euclidean norm or the length of the vector \(a\) it holds

\(\langle a,b \rangle = a^Tb = \|a \|\|b \| \cos \alpha(a,b),\)

\(\alpha(a,b)\) is the angle enclosed by \(a\) and \(b\). We now focus on the sign of this expression. Since the Euclidean norm or length of any nontrivial vector is positive (zero if and only if the vector is \((0,…,0)\)) the interesting part is the cosine. We switch to radian instead of degree because mathematically that is the natural domain of the trigonometric functions. An angle of \(90^{\circ}\) corresponds to \(\frac{\pi}{2}\), similarly, \(180^{\circ}\) corresponds to \({\pi}\) and \(270^{\circ}\) corresponds to \(\frac{3\pi}{2}\). The roots or zeros of the cosine function in the interval \([0^{\circ},360^{\circ}]=[0,2\pi]\) in radians are \(\frac{\pi}{2}\) and \(\frac{3\pi}{2}\). In the open interval \((\frac{\pi}{2},\frac{3\pi}{2})\) the cosine function is negative, thus the scalar product is negative. Conversely, for \(\alpha \in (0,\frac{\pi}{2})\cup (\frac{\pi}{2},2\pi)\) the cosine is positive. We have found our simple test based on the sign of the above scalar product:

Descent direction: Find a search direction \(s\) that satisfies \(s^T\nabla f(p)<0[ latex].<="" center=""/>

Determining the step size

There is a variety of methods available for this task. It is rather critical, e.g. in some settings there are very strict requirements necessary in order to make an algorithm work, but we will not go into detail here. We assume that we have found a reasonable descent direction [latex]d\) and now we need to decide how long the step in that direction shall be. Since we are only searching along a single direction, this is referred to as line search. It is essentially a one-dimensional problem and a first idea would be to directly find the next minimum in the given descent direction, an exact line search. In some cases this is even necessary in order to make the whole minimization algorithm work or at least for proving good convergence results but it requires solving an one-dimensional optimization problem in every iteration.
For algorithms like Gradient Descent we can rely on one of the most basic procedures, the so-called Armijo rule, an inexact line search method. Historically, it has been the very first non-exact stepsize rule. The original idea needs a slight technical modification called widening but for simplicity we omit this here. Later we will see more advanced stepsize procedures anyway. The Armijo rule takes two parameters, \(0< \sigma < 1[/latex], and [latex]0 < \rho < 0.5[/latex]. Now we test for [latex]\ell = 0,1,2,...[/latex] the Armijo condition:

[latex]f(p+\sigma^{\ell} d) < f(p) +\rho \sigma^{\ell}\nabla f(p)^Td.[/latex]

The first [latex]\ell\) that passes this test defines our stepsize \(\sigma^{\ell}\) for the current iteration. We formulate our first stepsize rule

Stepsize: Choose \(\sigma^{\ell}\) as stepsize according to the Armijo rule.

We aim at generating some progress in the descent direction \(d\) and we already know that if we make a tiny step in this direction then the function has to decrease. But in practice we do not want to make tiny steps. We want to choose a step as large as possible (whatever that might mean). Therefore, we always try a full gradient step, we test the stepsize \(1=\sigma^0\). But depending on the current point this might be too ambitious and we have to accept a smaller step size, and sometimes it is not ambitious enough and we could use larger stepsizes. The latter can be realised by a technique called widening that leads to better theoretical and practical properties for Armijo’s rule (efficiency).
But what exactly is the Armijo condition going to tell us? Let’s answer this question graphically.

The Armijo condition, i.e. the above inequality, compares two quantities: the function value on the left side and the cryptical value on the right side. The left side is represented by the black curve, the true function \(f\), and the concrete value \(f(p+\sigma^{\ell}d)\) is the function value when we make a step in direction \(d\) with stepsize \(\sigma^{\ell}\). The right side is represented by the red line. The green line is a straight line with exactly the same slope as the function \(f\) at the current point \(p\). Red and green line are very similar when you look at the defining functions. The one important difference between them is that the slope of the red line is smaller in terms of absolute value: for both lines the slope is negative and the red line is less steep which is ensured by the parameter \(0<\rho<0.5[ latex].="" But="" we="" know="" from="" Analysis="" that="" the="" function="" [latex]f[="" latex]="" are="" interested="" in="" locally="" decreases="" as="" green="" line="" near="" current="" point,="" and="" this="" implies="" funtion="" has="" to="" lie="" below the red line at least for all stepsizes [latex]t\) below some unknown threshold value \(t_s>0\). There has to be a whole interval \((0,t_s)\) of admissable stepsizes and for some finite number \(\ell\) the corresponding stepsize \(\sigma^{\ell}\) is smaller than \(t_s\) and larger than zero by definition, thus is contained in this admissable interval

\(\sigma^{\ell}\in (0,t_s)\).

Therefore, after a finite number of steps the Armijo condition is fulfilled, we have computed an admissable stepsize that guarantees a reasonable decrease.

Summary

We now have everything at hand for building our very first algorithm for unconstrained minimization of multivariate functions. We know when to stop, how to figure out whether the function decreases in a given direction or not and how to choose the length of the next step if the search direction is a descent direction.
There are several approaches how to find a good descent or search direction \(s\). We are going to start the next posting with the most obvious choice. Stay tuned…

Other articles in this series:
Machinery – Linear Regression
Machinery – Part 2

Was this post helpful?

Blog author

Stefan Kühn

Do you still have questions? Just send me a message.

fromStefan Kühn

The Machinery behind Machine Learning – A Benchmark for Linear Regression

This is the third post in the “Machinery behind Machine Learning” series and after all the “academic” discussions it is time to show meaningful results for some of the most prominent Machine Learning algorithms – there have been quite a few requests ...

Machine Learning

14.1.2016 | 10 minutes reading time

Stefan Kühn

The Machinery behind Machine Learning – Part 2

Machine Learning is one of the hottest topic on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize. This is part 2 of the series, and after all the preparation it definitely...

Machine Learning

20.4.2015 | 17 minutes reading time

Stefan Kühn

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 minutes reading time

Denis Stalz-John

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 minutes reading time

Denis Stalz-John

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 minutes reading time

Berthold Schulte

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) andfine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 minutes reading time

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications includeautomation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 9 minutes reading time

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 minutes reading time

Marcel Mikl

Why user-oriented development is so important – the story of tactics.ai

In this blog post, we want to give you an insight into the product development of tactics.ai. Our initial idea was a data-driven football analysis tool that applies machine learning techniques to analyze the strengths and weaknesses of opponents and ...

Agile
AI
Startup
Machine Learning
Product management

23.8.2020 | 8 minutes reading time

Denis Stalz-John

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

Evaluating machine learning models: The issue with test data sets

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends...

Data
Machine Learning
Data Science

25.3.2020 | 6 minutes reading time

Berthold Schulte

Great Expectations: Validating datasets in machine learning pipelines

Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true...

Python
Data
Machine Learning

17.2.2020 | 7 minutes reading time

Marcel Mikl

Remote training with GitLab-CI and DVC

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or...

Git
Machine Learning
CI/CD
AI
GitLab

27.1.2020 | 15 minutes reading time

Marcel Mikl

AWS SageMaker Machine Learning Data handling

Seven ways of handling image and machine learning data with AWS SageMaker and S3If you start using AWS machine learning services, you will have to dive into data handling with AWS SageMaker and S3. We want to show you seven ways of handling image and...

AWS
Computer Vision
Data
AI
Machine Learning

17.1.2020 | 10 minutes reading time

Realtime face detection and filtering with the Coral USB accelerator

In this blog post we explain how you can build your own face detection application without much machine learning knowledge. Why? At codecentric everyone has one day per week for professional development and training. Among other things we use this time...

Software architecture
Machine Learning

8.11.2019 | 10 minutes reading time

Christoph Knauf

Tackling climate change with machine learning [part 6] – Datasets & further...

Before we get started with this chapter, here is the full summary video, containing all 5 previous parts, enjoy! By loading the video, you agree to YouTube's privacy policy. Learn more Load video Always unblock YouTube The first 5 chapters of this...

Data
AI
Machine Learning

26.9.2019 | 4 minutes reading time

Tackling climate change with machine learning [part 5] – Industry & carbon...

By loading the video, you agree to YouTube's privacy policy. Learn more Load video Always unblock YouTube On 10th of June, 2019, twenty-two AI researchers, including Andrew Ng and Yoshua Bengio, published a paper on how climate change can be tackled...

Data
AI
Machine Learning

25.9.2019 | 5 minutes reading time

Tackling climate change with machine learning [part 4] – Farms & Forests

Data
AI
Machine Learning

24.9.2019 | 4 minutes reading time

The Machinery behind Machine Learning – Part 1

Unconstrained optimization

Looking for local minima

A general minimization algorithm

Checking for optimality

Finding a descent direction

Descent direction – a simple test

Determining the step size

Summary

Was this post helpful?

Blog author

More articles

The Machinery behind Machine Learning – A Benchmark for Linear Regression

The Machinery behind Machine Learning – Part 2

More articles in this subject area

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Python on an M1 chip: Running smoothly using Docker

Evaluating machine learning models: Establishing quality gates

The universal recommender in Action(ML)

NER with little data? Transformers to the rescue!

Take control of named entity recognition with your own Keras model!

NER @ CLI: Custom-named entity recognition with spaCy in four lines

DISH-O-TRON – Gather that DATA you must!

Why user-oriented development is so important – the story of tactics.ai

Thinking AI means re-thinking data

Evaluating machine learning models: The issue with test data sets

Great Expectations: Validating datasets in machine learning pipelines

Remote training with GitLab-CI and DVC

AWS SageMaker Machine Learning Data handling

Realtime face detection and filtering with the Coral USB accelerator

Tackling climate change with machine learning [part 6] – Datasets & further...

Tackling climate change with machine learning [part 5] – Industry & carbon...

Tackling climate change with machine learning [part 4] – Farms & Forests