The Machinery behind Machine Learning – Part 2

20.4.2015 | 17 minutes reading time

Machine Learning is one of the hottest topic on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize.
This is part 2 of the series, and after all the preparation it definitely is time to do the first numerical experiments. You can find the respective R code here .

As promised in part 1 we now start with the obvious choice of the descent direction – the negative gradient. From this choice the name Gradient Descent is somewhat obvious but the method has another name: Steepest Descent. Let’s figure out why this name is adequate and what Gradient Descent can do for us.

Gradient Descent

The idea of Gradient Descent is following the path that ensures the quickest win near a given point. The direction specifying this path is given by the negative gradient at the current point:

$s = – \nabla f(p)$

The negative gradient corresponds to an angle of $\alpha=180^{\circ}=\pi$, thus is exactly at the center of all possible descent directions given by \(90^{\circ} < \alpha < 270^{\circ}[/latex]. It's an almost trivial, natural choice since it obviously satisfies [latex]s^T\nabla f(p)<0[ latex],="" the="" defining="" property="" of="" a="" decent="" direction.="" Due="" to="" this="" negative="" gradient="" stands="" out,="" but="" there="" is="" more.="" In="" Optimization="" Theory="" we="" love="" formulate="" anything="" as="" minimization="" problem="" and="" in="" fact="" (scaled)="" always="" solution="" following="" auxiliary="" constrained minimization problem

[latex]\min\ s^T\nabla f(p),\quad {s\in \mathbb{R}^n,\ \|s \|=1}.\)

We have already seen that any $s$ with $s^T\nabla f(p)<0[ latex]="" is="" a="" descent="" direction="" at="" the="" point="" [latex]p[="" latex].="" Furthermore,="" if="" we="" fix="" length="" of="" [latex]s[="" as="" done="" here="" by="" [latex]\|s="" \|="1[/latex]," value="" [latex]s^T\nabla="" f(p)[="" direct="" estimate="" for="" progress="" that="" can="" be="" made="" near="" in="" terms="" achievable="" reduction="" function="" when="" actually="" performing="" step="" This="" follows="" from="" Taylor series expansion of [latex]f$ around $p$ in direction $s$ which is beyond the scope of this post. But we keep in mind that the solution of this auxiliary problem – the negative gradient – guarantees “optimal” progress near the given point. This is the reason why it is called the direction of steepest descent, the “best” descent direction in a local sense because it allows for the maximum immediate progress.

Greedy methods

Algorithms like Gradient Descent that are based on locally optimal steps are often called greedy because they take the quickest win offered at the current point. It would not be easy to convince us that we might consider something else than this greedy choice if there is an urgent need for progress – e.g. if we just are thirsty enough, to refer to our find-water-in-the-desert scenario from part 1 . But – as it is the case with greedy choices quite often – taking the quick win might not be the best in the long run. The most prominent example for this phenomenon probably is the Travelling Salesman Problem . Here, the greedy approach can even lead to the worst possible solution. The situation with Gradient Descent is somewhat comparable. Fortunately, it always works in the sense that finally you end up near a critical point. But depending on the curvature of the function the path to the minimum can be arbitrarily long and the number of steps can be very high even if you start “close” to the optimum.

Test functions

We now demonstrate the typical behaviour of Gradient Descent on four test functions, starting with a trivial case and ending with a worst case scenario for Gradient Descent. For each of the functions there will be a contour plot for the $2-D$ case with some additional information. A blue arrow is indicating the optimal search direction pointing directly towards the optimum – the red plus – and a grey arrow indicates the negative gradient.
The first test function is our simple example function, half of the squared norm, that we now call

$f_1(p)=f_1(p_1,…,p_n) := 0.5* \sum\limits_{i=1}^n p_i^2=0.5*\left\|p \right\|^2.$

It is a so-called radial function , i.e. the function value only depends on the norm of $p$ and not on the concrete coordinate entries, thus minimizing it essentially boils down to a one-dimensional problem. We expect our algorithms to notice that and to benefit from it.

The contour lines are circles, and here the negative gradient always coincides with the optimal decent direction, it always points directly to the optimum. This example is the best case scenario, we will see that the method converges immediately to the unique global minimum, there is no dependence on the dimensionality of the problem nor the initial value. The function is invariant under all rotations that leave the origin fixed, in particular symmetric, i.e. you can interchange $p_i$ and $p_j$ and the function value remains the same.
The second function $f_2$ is an anisotropic variant of $f_1$, given by

$f_2(p)=f_2(p_1,…,p_n) := 0.5* \sum\limits_{i=1}^n (i*p_i)^2$,

i.e. the function value changes depending on the direction. There is no rotation invariance and no symmetry, when interchanging variables. This should slow down the convergence of our algorithms a bit but in the end the function still is, well, mostly harmless.

The contour lines are ellipses, and depending on the scaling of the different axes – in our case the scaling is represented by the factors $i$ – the negative gradient can be almost orthogonal to the optimal direction. Here, dimensionality and initial conditions start to matter significantly. In fact Gradient descent in general is highly sensitive to the starting conditions. Concerning $f_2$ this dependency can roughly be characterized as follows: If the starting point is located on one of the coordinate axes, then Gradient descent is optimal, but if the starting point is far away from the axes then the negative direction is clearly suboptimal as already mentioned.
The third function is

$f_3(p):=f_3(p_1,…,p_n):=\sum\limits_{i=1}^n p_i^4$,

again a simple symmetric function, but not radial or rotational invariant and thus slightly less trivial than $f_1$.

The contour lines look like squares with rounded corners, and again the negative gradient can differ significantly from the optimal direction depending on the current point.
The last and most interesting function $f_4$ is a variant of the Rosenbrock function, a well-known test function for optimization routines, see e.g. the Wikipedia entry for the Rosenbrock function :

$f_4(p):=f_4(p_1,…,p_n):=\sum\limits_{i=1}^{n-1} [(a-p_i)^2 + b* (p_{i+1}-p_i^2)^2]$,

the standard parameter choice being $a=1$ and $b=100$. The Rosenbrock function is highly anisotropic as we will see from the contour plot. Furthermore, there are “mixed” terms, i.e. nonlinear dependencies between the different variables $p_1,…,p_n$, take e.g. the term $(p_2-p_1^2)^2 = p_2^2 -2*p_2p_1^2+ p_1^4$. The Rosenbrock function is also known as Rosenbrock’s banana function, a name that indicates the somewhat special shape which exhibits a noticeable curvature.

You might have noticed that the contour plot has a different parameter setting than the standard choice as mentioned in the definition of the function $f_4$. The plot shows the contour lines for $a=1,b=5$ and not $a=1,b=100$. It’s somewhat funny that I actually had to change the parameter $b$ to a much less critical value in order to make the extreme curvature visible. With $b=100$ the curvature is so extreme that a standard contour plot cannot resolve it accurately, everything seems to be symmetric, take a look yourself.

From this plot it is rather obvious that the negative gradient – they grey arrow – can be a very poor search direction despite its property of being locally optimal. Furthermore, due to the enormous length of the gradient the stepsize will be tiny resulting in a very high number of function evaluations as we will see later.

Stepsize methods

In order to get a feeling for the importance and the impact of stepsize methods we are going to compare three different variants, the standard Armijo rule, the Armijo rule with the widening approach I have already mentioned in part 1 ) of this series, and an exact stepsize method based on a simple but rather robust bisection method, that basically finds the root of a univariate function in an appropriate interval. We apply this bisection method to our line search problem, i.e. finding the root of the directional derivative with the direction being the negative gradient, which is a one-dimensional problem. The exact method is in general not competitive in terms of runtime or numerical costs because it needs a huge amount of gradient evaluations but here we are mostly interested in how the number of iterations is affected by this choice, a qualitative comparison. Intuitively one would guess that solving the line search problem exact would result in faster convergence, i.e. less iterations. Let’s figure out how good our intuition is.

The first numerical tests

For a sound comparison of the performance of the different methods that we intend to introduce in this series one cannot look at the runtime only. Since all our methods are iterative methods we have to consider at least the number of steps and the most important computations that are involved. To get a basic overview we use four different values: runtime, number of iterations, number of function evaluations and number of gradient evaluations. At first we try to get a feeling for how the dimension of the problem and the choice of the initial values affects the algorithmic behaviour. The optimization procedure stops if the norm of the gradient at the current iterate is smaller than $tol=1e^{-6}$.

Initial values for $f_1,f_2,f_3$

For these functions that are rather similar anyway and share the same minimum at $(0,…,0)$ we use three initial values with the meaningful names good_point / average_point / bad_point that are chosen to cause an algorithmic behaviour that hopefully justifies these names. The concrete values are

$latex
\begin{matrix}
good\_point & = & \dfrac{1}{10}*(1,…,1)\\
& & \\
average\_point & = & -\dfrac{10}{n} * (1,2,…,n)\\
& & \\
bad\_point & = & \dfrac{100}{n} * (1,2,…,n)
\end{matrix}
$

Initial values for Rosenbrock function $f_4$

The Rosenbrock function $f_4$ is rather different from the others, so I have prepared some special initial values as well. The minimum is at $(1,…,1)$ and due to the curvature of the function I have chosen the following set of points

$latex
\begin{matrix}
good\_point & = & \dfrac{9}{10}*(1,…,1)\\
& & \\
average\_point & = &-\dfrac{100}{n} * (1,2,…,n) + (2,…,2)\\
& & \\
bad\_point & = & \dfrac{100}{n} * (1,2,…,n)
\end{matrix}
$

We do not use all points in all tests, but you can easily do experiments on your own.

Test 1 – Impact of problem dimension

This is just to get a feeling how things scale in higher dimensions. We have used the bad starting point for $f_1,f_2,f_3$, the good starting point for $f_4$ and always the standard Armijo rule.

The trivial problem $\min f_1$ – essentially one-dimensional – is solved in a constant number of iterations regardless of the dimension which is the perfect scaling. Due to our parameter choice it takes exactly one step since a full step in direction of the negative gradient always ends in the global minimum.
The second problem $\min f_2$ – with anisotropies getting more serious with increasing dimension – shows a quadratic complexity as the number of iterations, function and gradient evaluations increases by a factor of roughly $4=2^2$ when the problem dimensions doubles. Such a scaling is problematic as it clearly limits the applicability to small or medium dimensions. Typically one wants to achieve linear or log-linear scaling, i.e. linear up to logarithmic factors.
The third problem $\min f_3$ that is supposed to be more complex as it has a higher degree seems to be nicer. The number of iterations etc. increases only by a factor of roughly $1.26$ when the problem dimensions doubles. This is a sublinear growth, which implies good scalability.
The fourth problem $\min f_4$ appears to be surprisingly harmless, in fact it also shows perfect scaling. Of course the runtimes always increase at a higher rate because the function/gradient evaluations etc. induce a higher number of arithmetical operations in higher dimensions but this is not our primary focus here.

As a general note: We should be careful when drawing conclusion from our data. All we have seen yet is a single data point, so we should not overemphasize these results or generalize from these values too much. Nevertheless, there is something we can take with us: The dimensionality of the minimization problem and the complexity of the objective function are not necessarily related. This we can safely derive as we need only one counterexample – in this case $f_4$ – when disproving hypotheses. It’s much harder to prove them.
Furthermore, the results depend on factors that we tend to overlook easily. For this first test the termination criterion plays a crucial rule. In higher dimensions it is much harder to reduce the norm below an absolute bound like $tol=1e^{-6}$, and if we would e.g. change it such that we apply the criterion to every component, i.e. replacing the Euclidean norm by the maximum norm, we would see a different behaviour. Especially in really high dimensions one has to carefully think about the implications of such things as e.g. criteria using the Euclidean norm can imply accuracy requirements in each coordinate that are way beyond machine precision. Quite often it makes sense to use relative criteria or a mixture of both absolute and relative criteria, or to choose problem-specific norms.

Test 2 – Impact of stepsize – Part 1

Now I would like to draw your attention to the stepsize rules, that are the second-most important ingredients in our algorithmic setup. First we try to evaluate how good our intuition was concerning the difference between exact and inexact stepsizes.

Actually – and if one wants to judge things well-balanced this is quite often the case – we cannot say much about it in general. There are cases where the exact stepsize is dramatically better, but the opposite can happen as well, see Table 4. We take with us, that it’s always a good idea to test things yourself that are not clear a priori. Each stepsize method has it’s applications and you will achieve the best results when finding the one that fits to your specific problem best.

Test 3 – Impact of stepsize – Part 2

In the last test there has been one case – $\min f_3$ – where the Armijo rule turned out to be a surprisingly bad choice. But what exactly is causing this? When you take a closer look you will notice that the algorithm is almost always (except 4 times) accepting the full stepsize, i.e. effectively the Armijo rule is not doing anything. The reason for this is that the length of the negative gradient is rather small, while the search direction is quite good for this problem. And here comes the widening idea into play. If the full negative gradient step leads to sufficient decrease, why not trying even larger steps? Here are the results.

As you can see here, a minimal modification led to a massive improvement, and the widening approach even outperforms the exact stepsize in terms of iterations. It turns out that not the general idea of the Armijo rule – iteratively adapting the steplength – has been “wrong”, it’s just the artificial restriction of not trying larger stepsizes that resulted in a severe degradation of performance. In some sense the standard Armijo rule is greedy as well, it always accepts the first admissible stepsize. Again, this turns out to not be the best choice in general.

Test 4 – Impact of stepsize – Part 3

This is the final test for now, and personally I think this is the most interesting case as well. I have set the parameters of the Rosenbrock function to $a=1,b=5$, which makes it less hard for Gradient Descent but still hard enough.

In dimension $n=20$ everythings looks somewhat unsurprising. In terms of iterations the exact stepsize is the best choice, while widening makes almost no difference at all given the total number of iterations. But in dimension $n=25$ several interesting things start to happen.
First of all, you should forget about the exact stepsize for Gradient Descent applied to the Rosenbrock function even if it has been slightly better for $n=20$. On a general level the reason for this simply is that the search direction is bad anyway, so there is not much you can expect when optimizing in this direction. When you look at the stepsizes you will notice that the behaviour of Armijo and the exaxct stepsize rule is rather different. After the first few iterations Armijo takes an almost constant stepsize, while the exact method exhibits some kind of alternating behaviour. If your search direction is not good you should not overoptimize, it is better to not try making larger steps at any cost, but this is what the exact method is doing regularly. After such a long step, the next step can be orders of magnitude smaller which implies that the situation has become even worse. The path of steps using the exact method proceeds almost at the bottom of the banana-shaped valley. The inexact methods slightly stay away from the very bottom so the convergence is better in the end. If you want to see this on your own, just print out the first 50-100 stepsizes and compare them to the stepsizes after 10000 iterations or so. In the beginning the exact method is taking larger steps from time to time but once it is trapped deep down in the valley the stepsizes remain very small. After some thousands of iterations the Armijo stepsize leads to larger steps on average which is what enables faster convergence.

But the most interesting aspect is the effect of the three(!) widening steps. This tiny change – one widening step every 12000 steps – improves the perfomance by more than 13%. This shows the sometimes extreme sensitivity of Gradient Descent once more and it proves that it can massively pay off when investing in this kind of optimization. That’s exactly the message here.

Do it yourself

If you are interested in experiencing these things on your own feel free to download the code and give it a try. And if you want to see the sometimes strange behaviour of Gradient Descent in action – even stranger than what we have already seen in the above tests, then I can recommend the following setup: Choose dimension $n=8$ (or $n=80$ if you really have a lot of spare time, don’t forget to adapt the max iterations parameter) and try to minimize the Rosenbrock function $f_4$ with the parameters $a=1,b=100$, the standard Armijo rule and the average starting point (this is important, it only works with the average point). Then slightly change the starting point by adding $10^{-2}$ ($10^{-1}$ if you can’t wait anymore and want a quick result) to any coordinate of the average starting point (R does this for you automatically, you can just add it as a constant: + 1e-2 or 1e-1). You need approximately ten minutes for this in $n=8$ dimensions, just in case you are planning to test the $n=80$ scenario first.

Fixed stepsize

One comment on fixed stepsizes: Quite often you can find descriptions of Gradient Descent without any stepsize rule, just using a fixed small stepsize, see e.g. the Wikipedia entry for Gradient Descent . While this in theory works if the stepsize is small enough I cannot recommend this by any means. The small effort invested in choosing a reasonable stepsize definitely pays off and the method runs orders of magnitude faster. You can test this yourself if you like, just follow the instructions in the Readme . I did not present any numbers here, but if you want to have some fun then try minimizing $f_3$ in dimension $n=7$ starting from the average point in less than 1 million iterations – and don’t just relax $tol$.

Summary

Finally, we have our first interesting results at hand. In particular we have learned something about stepsizes and how Gradient Descent behaves in different settings. We have seen best and worst cases, sensitivity with respect to initial values and the massive impact of minimal changes, e.g. Armijo with widening in Test 3.
Let me thank you for your attention, it’s been a long post but I hope it’s been worth the time. In the next part of this series we will see how the Nonlinear Conjugate Gradient method performs when applied to our test functions, especially the Rosenbrock function. And of course we will learn how Nonlinear CG works algorithmically and why and when it might be a good idea to choose this approach. See you soon…

Other articles in this series:
Machinery – Linear Regression
Machinery – Part 1

Was this post helpful?

Blog author

Stefan Kühn

Do you still have questions? Just send me a message.

fromStefan Kühn

The Machinery behind Machine Learning – A Benchmark for Linear Regression

This is the third post in the “Machinery behind Machine Learning” series and after all the “academic” discussions it is time to show meaningful results for some of the most prominent Machine Learning algorithms – there have been quite a few requests ...

Machine Learning

14.1.2016 | 10 minutes reading time

Stefan Kühn

The Machinery behind Machine Learning – Part 1

Machine Learning is one of the hottest topics on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize. This is the first blog post in a small series about the machinery that ...

Machine Learning

19.3.2015 | 15 minutes reading time

Stefan Kühn

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 minutes reading time

Denis Stalz-John

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 minutes reading time

Denis Stalz-John

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 minutes reading time

Berthold Schulte

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) andfine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 minutes reading time

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications includeautomation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 9 minutes reading time

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 minutes reading time

Marcel Mikl

Why user-oriented development is so important – the story of tactics.ai

In this blog post, we want to give you an insight into the product development of tactics.ai. Our initial idea was a data-driven football analysis tool that applies machine learning techniques to analyze the strengths and weaknesses of opponents and ...

Agile
AI
Startup
Machine Learning
Product management

23.8.2020 | 8 minutes reading time

Denis Stalz-John

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

Evaluating machine learning models: The issue with test data sets

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends...

Data
Machine Learning
Data Science

25.3.2020 | 6 minutes reading time

Berthold Schulte

Great Expectations: Validating datasets in machine learning pipelines

Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true...

Python
Data
Machine Learning

17.2.2020 | 7 minutes reading time

Marcel Mikl

Remote training with GitLab-CI and DVC

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or...

Git
Machine Learning
CI/CD
AI
GitLab

27.1.2020 | 15 minutes reading time

Marcel Mikl

AWS SageMaker Machine Learning Data handling

Seven ways of handling image and machine learning data with AWS SageMaker and S3If you start using AWS machine learning services, you will have to dive into data handling with AWS SageMaker and S3. We want to show you seven ways of handling image and...

AWS
Computer Vision
Data
AI
Machine Learning

17.1.2020 | 10 minutes reading time

Realtime face detection and filtering with the Coral USB accelerator

In this blog post we explain how you can build your own face detection application without much machine learning knowledge. Why? At codecentric everyone has one day per week for professional development and training. Among other things we use this time...

Software architecture
Machine Learning

8.11.2019 | 10 minutes reading time

Christoph Knauf

Tackling climate change with machine learning [part 6] – Datasets & further...

Before we get started with this chapter, here is the full summary video, containing all 5 previous parts, enjoy! By loading the video, you agree to YouTube's privacy policy. Learn more Load video Always unblock YouTube The first 5 chapters of this...

Data
AI
Machine Learning

26.9.2019 | 4 minutes reading time

Tackling climate change with machine learning [part 5] – Industry & carbon...

By loading the video, you agree to YouTube's privacy policy. Learn more Load video Always unblock YouTube On 10th of June, 2019, twenty-two AI researchers, including Andrew Ng and Yoshua Bengio, published a paper on how climate change can be tackled...

Data
AI
Machine Learning

25.9.2019 | 5 minutes reading time

Tackling climate change with machine learning [part 4] – Farms & Forests

Data
AI
Machine Learning

24.9.2019 | 4 minutes reading time

The Machinery behind Machine Learning – Part 2

Gradient Descent

Greedy methods

Test functions

Stepsize methods

The first numerical tests

Initial values for \(f_1,f_2,f_3\)

Initial values for Rosenbrock function \(f_4\)

Test 1 – Impact of problem dimension

Test 2 – Impact of stepsize – Part 1

Test 3 – Impact of stepsize – Part 2

Test 4 – Impact of stepsize – Part 3

Do it yourself

Fixed stepsize

Summary

Was this post helpful?

Blog author

More articles

The Machinery behind Machine Learning – A Benchmark for Linear Regression

The Machinery behind Machine Learning – Part 1

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Python on an M1 chip: Running smoothly using Docker

Evaluating machine learning models: Establishing quality gates

The universal recommender in Action(ML)

NER with little data? Transformers to the rescue!

Take control of named entity recognition with your own Keras model!

NER @ CLI: Custom-named entity recognition with spaCy in four lines

DISH-O-TRON – Gather that DATA you must!

Why user-oriented development is so important – the story of tactics.ai

Thinking AI means re-thinking data

Evaluating machine learning models: The issue with test data sets

Great Expectations: Validating datasets in machine learning pipelines

Remote training with GitLab-CI and DVC

AWS SageMaker Machine Learning Data handling

Realtime face detection and filtering with the Coral USB accelerator

Tackling climate change with machine learning [part 6] – Datasets & further...

Tackling climate change with machine learning [part 5] – Industry & carbon...

Tackling climate change with machine learning [part 4] – Farms & Forests