Remote training with GitLab-CI and DVC

27.1.2020 | 15 minutes reading time

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or a computing instance in the cloud. In this article, we show how you can build a custom remote training set up for your machine learning models. We aim for automation and team collaboration.

From a technology point of view we use an EC2 instance on AWS for the training of the model. The automation is implemented via a GitLab-CI pipeline that we trigger with special commit messages. Furthermore, we use DVC to achieve reproducibility of the model training and, on the other hand, for versioning data and model. Furthermore, we use an S3 Bucket on AWS as remote storage for DVC. However, the setup does specifically require AWS and can be adapted, e.g. to on-premises hardware. For an introduction to DVC we refer to here .

If you are interested in our work, you likely have read the very popular article CD4ML . In our blog post, we cover one particular topic of that article very deeply, namely how to conduct the actual training, with special focus on technical aspects as well as team work. We do not discuss the setup of data pipelines, how to deploy the application, or monitoring.

DVC remote training: The high-level idea

The following picture gives a high-level view of the general idea of the setup.

First, we build a Docker image which we use to train the machine learning model. The Docker image provides all runtime dependencies for the training, e.g. libraries or command line programs. However, it does not contain the training data; that data is stored at the training location and must be mounted into the container when executing the training.

The image is pulled to the prefered training location. The choice of the training location is very flexible, the container could be running on your laptop, on a powerful machine in your basement, or anywhere else in the cloud. We use DVC to manage the training data. In particular, we use DVC’s functionality to permanently store and version data at the training location. This way we can avoid transferring the entire training data to the training location for each training. Instead, we utilize DVC to perform incremental updates. After the training, the training results as well as the associated training data are versioned and stored in the so-called DVC remote storage.

Finally, we trigger a GitLab release for any new version of the model. In this step we upload the training result to an S3 Bucket and use the GitLab releases API to generate a release page.

Our setup addresses teams where team members develop the ML pipeline simultaneously. Each team member uses a separate training location and therefore has access to exclusive compute power and a consistent training environment. However, they share a common remote storage location. The following picture visualizes the setup.

Key aspects

Before we dig into the details of our GitLab CI pipeline, we briefly discuss other key aspects of our setup. Afterwards, we discuss our project setup in more detail. Here the special focus is on the automated model training which consists of three stages in our GitLab CI pipeline.

The code repository

The complete code for the project can be found here here . The code repository covers three main concerns.

A DVC project with an ML pipeline.
The runtime environment for the remote training.
Executing the training and releasing the newly trained model.

For the sake of conciseness, we decided to implement all three concerns in a single repository. However, they should in general be split into three different repositories.

The ML pipeline

As our focus is on remote training, we do not discuss details of the ML pipeline (such as model architecture, training configuration, etc.) and treat the ML pipeline as a black box. Therefore, the example code implements only a rudimentary ML pipeline for classifying the Fruits 360 image data set.

We use a simple Keras model and export the trained model in the onnx format. Our colleague Nico Axtmann showcases the advantages of using the onnx format in his blog post (german).

Remote training

As discussed in the section “The high level idea”, the training does not take place in the GitLab runner. Instead, we execute the training on an EC2 instance. Dependencies of the training code, e.g. binary executable, libraries, are provided by a Docker container. After the training is completed, the container will be destroyed.

However, in order to save time and bandwidth, we do not check out the DVC project at each and every container start. Instead, the project is checked out to persistent memory of the EC2 instance hosting the container and is mounted into the container. This way only incremental code and training data changes must be fetched before the training.

Working in a team

Just like for software development, tooling does not eliminate the need to communicate with your team. (After all, tooling should help us establish reliable and efficient means of communication.) Good communication is of increased importance when developing software in the same (feature) branch. When training a model remotely, each ‘trainer’ prepares the repository by committing training data to it, then triggers the training process (via a special kind of commit), and after training has finished, the produced model is automatically committed to the repository as well.

Thus, remote training for the same feature branch is even more prone to race conditions in the commit history than common software development. In particular, in case of a feature-branch-based development process, merging to master must be coordinated carefully. Moreover, each training run relies on consistent data in the working directory. Consequently, two team members must not simultaneously trigger the training process in the same training location. Therefore, we utilize a different training location for each team member. This also allows everybody to independently choose the training branch. Plus, each training run has exclusive access to compute resources.

The CI pipeline

The GitLab CI pipeline definition is contained in the file .gitlab-ci.yml. In this section, the term pipeline refers to this CI pipeline, not to the DVC pipeline which we consider a given black box. The pipeline has three main concerns, namely building the Docker image that provides the training environment, the actual training, and the release of the trained model.

1stages:
2- build_train_image
3- train
4- release

In our simple pipeline, each stage contains exactly one job, which for simplicity is called the same as the stage. On each commit, a selection of the stages are executed. We either execute the build_train_image stage alone, or the train stage followed by the release stage.

Each stage runs a so-called GitLab runner somewhere in the cloud. In the train stage, the actual training is delegated away from the GitLab runner to a dedicated machine, as we discuss below.

Stage 1: Building the training image

Since the runtime environment for the training changes less frequently than the pipeline, we do not run the build_train_image stage on every commit. Instead, a special commit message is required to run this stage. In particular, the commit message has to start with build image.

The following snippet of .gitlab-ci.yml defines this trigger, where the variable $CI_COMMIT_MESSAGE is provided by the runner and contains the commit message.

1.requires-build-image-commit-message:
2only:
3variables:
4- $CI_COMMIT_MESSAGE =~ /^build image/

This snippet is referenced in the build_train_image stage as follows.

1build_train_image:
2stage: build_train_image
3extends:
4- .requires-trigger-training-commit-message

The training image definition is contained in the Dockerfile in the root of the GitLab repository. When the build_train_image stage runs, GitLab takes care of checking out the repository contents into the GitLab runner’s working directory. From here, the runner picks up the Dockerfile to build the training image.

We use kaniko to build the training image. Using kaniko does not require a Docker daemon in order to build the image. This increases security, since there is no need for privileges in the GitLab runner, and it usually speeds up the build.

We configure kaniko by using gcr.io/kaniko-project/executor:debug as the stage’s GitLab runner’s image. The first line of the stage script is required to configure kaniko correctly. The script uses some environment variables provided by GitLab. The custom variable $DOCKER_REGISTRY points to AWS’ Elastic Container Registry (ECR, for short), where the final image will be stored. The /kaniko/executor picks up the Dockerfile from the $CI_PROJECT_DIR variable, which is provided by default by the GitLab runner and refers to the checked out Git repository. The final image will be stored in the ECR under the name dvc_example:train_ followed by the name of the current branch, e.g. dvc_example:train_example_branch (the tag train is stored in the custom GitLab variable $DOCKER_TAG, the branch name is available in the default GitLab variable $CI_COMMIT_REF_NAME).

1build_train_image:
2stage: build_train_image
3…
4image:
5name: gcr.io/kaniko-project/executor:debug
6entrypoint: [""]
7script:
8# configure and run kaniko (ecr login creds come from env vars)
9- echo "{\"credHelpers\":{\"$DOCKER_REGISTRY\":\"ecr-login\"}}" > /kaniko/.docker/config.json
10- /kaniko/executor --context $CI_PROJECT_DIR \ 
11--dockerfile $CI_PROJECT_DIR/ \
12--destination $DOCKER_REGISTRY/dvc_example:${DOCKER_TAG}_${CI_COMMIT_REF_NAME}

Including the branch name in the image tag allows us to develop the training image without affecting team members working in other branches. (Note that, when creating a new branch, the training image must be built before the first training on this branch.)

Stage 2: Training the model

First, we present the base setup of the train stage. Training the model might be a time-consuming procedure. Therefore, as for building the training image, we do not train the model on each and every commit, but only if the committer specifically instructs the pipeline to execute the training. Again, a special commit message has to be provided that starts with trigger training followed by a descriptive tag (the tag marks the resulting training artifacts for release, see subsection below).

1.requires-trigger-training-commit-message:
2only:
3variables:
4- $CI_COMMIT_MESSAGE =~ /^trigger training [a-zA-Z0-9_\-\.]+/

In this stage, we use a python:3.6-alpine environment and supplement it with libraries and binaries needed to conduct the training. For example, we use the boto3 library to start and stop the EC2 instance. Credentials to communicate with AWS services are stored in custom GitLab runner environment variables, such that they are available to our calls of boto3.

1train:
2stage: train
3extends:
4- .requires-trigger-training-commit-message
5image: python:3.6-alpine
6script:
7- pip3 install boto3 fire
8…

Next, we outline what is happening in the train stage. The actual training will not be executed in the GitLab runner, as, generally, the runner is located on an “all-purpose” machine, whereas training might require special hardware like GPUs. Therefore, the train stage delegates the training to another machine, namely an AWS EC2 instance. We do not discuss the delegation in detail, but we note that the script bin/orchestrate_ec2.py takes care of starting/stopping the EC2 instance for cost efficiency and monitors the running instance to detect when the training is concluded. For better inspection of the pipeline, we log the orchestration command with all its parameters before actually executing it.

1train:
2…
3script:
4…
5- release_name=`bin/commit_message_to_release_name.sh …
6… "$CI_COMMIT_MESSAGE"`
7- cmd="cmd="python bin/orchestrate_ec2.py execute_orchestration …
8… $TRAIN_INSTANCE_FOR_USER $GITLAB_USER_EMAIL …
9… $CI_COMMIT_REF_NAME …
10… $DOCKER_REGISTRY/dvc_example ${DOCKER_TAG}_${CI_COMMIT_REF_NAME} $release_name"
11- echo $cmd
12- $cmd

The variables given as arguments to orchestrate_ec2.py configure the training and release, as we discuss in the following subsections:

Training configuration

The variables $DOCKER_REGISTRY and $DOCKER_TAG determine the Docker training image that is pulled to the EC2 instance before starting the container. Both variables have the same values as in the build_train_image stage, i.e., we use the most recent build of the training image.

To allow for a flexible development workflow in teams, we support branch-based development. For example, a team member might develop a new DVC pipeline stage in a branch other than master before making it available for the rest of the team (by merging to master). Since team members might conduct training for different branches simultaneously, each member uses a separate “private” EC2 instance.

The variable $GITLAB_USER_EMAIL is provided by default by the GitLab runner and identifies the committer for the pipeline run in question. The mapping stored in the file $TRAIN_INSTANCE_FOR_USER lets the orchestration script determine the committer’s private instance to forward the training to. For security, the content of the file $TRAIN_INSTANCE_FOR_USER is also a GitLab variable and not committed to the repository. This is what a mapping might look like (EC2 instance IDs are fake):

1{
2"marcel.mikl@codecentric.de": "i-0a9ec87b6ae9cf87b",
3"bert.besser@codecentric.de": "i-0b07acec0ef7a8fbc"
4}

The variable $CI_COMMIT_REF_NAME contains the branch name of the commit. The orchestration script instructs the EC2 instance to switch to the given branch before starting the training. Note that artifacts of all branches are pushed to the same DVC remote storage.

Release preparation

After training, in the final step of the EC2 instance we push newly generated binary artifacts to the DVC remote and commit/tag the DVC pipeline state in the Git repository. This is where the descriptive tag of the commit message comes into play; it serves as the Git commit tag for future reference of this training’s artifacts. We use the script bin/commit_message_to_release_name.sh to extract the third token of the commit message and store it in the variable $release_name. The orchestration script then forwards $release_name to the EC2 instance.

Stage 3: Releasing the model

After the train stage finishes successfully, the release stage takes care of making the training results available publicly, using the script bin/upload_and_release.sh. The script copies the file train.dvc to a public S3 bucket. Also, it creates a GitLab release page containing a link to the copied file (environment variables provide credentials and location information for the script). Again, the stage will only run if the committer demands a training using a commit message of the form trigger training , where the release tag determines the name of the GitLab release.

1release:
2stage: release
3image: python:3.6-alpine
4extends:
5- .requires-trigger-training-commit-message
6before_script:
7- apk add --no-cache curl
8- pip3 install awscli
9script:
10- release_name=`bin/commit_message_to_release_name.sh "$CI_COMMIT_MESSAGE"`
11- cmd="bin/upload_and_release.sh $release_name $BUCKET_NAME $CI_PROJECT_ID $GITLAB_TOKEN"
12- echo $cmd
13- $cmd

Further Thoughts

Separation of concerns

We chose to trigger the training using a special commit message, since we want training results to be tagged properly. That is, we use tags for releases of training results exclusively. Separating the GitLab pipeline orchestrating the training into another code repository would allow us to also use tags for triggering the training. In our opinion, this approach allows for better inspection into the history of who/when/… triggered a training.

Moving the code that creates the training container out of the DVC project’s repository clearly improves separation of concerns. The beneficiaries of this procedure would be e.g. data scientists, since their work environments for the ML pipeline are not ‘polluted’ with cloud and container concerns. However, this separation introduces a dependency, since the runtime environment must be prepared with the required software for the actual training.

Using the trained model in an application

Typically, the trained model will be used in an application, e.g. a web server providing prediction with a REST API. In order to build an application using the model, our proposed method to retrieve the model’s binary file referenced in train.dvc is

Initialize a DVC repository in the application repository and configure the same remote storage as for the model training repository.
Download the train.dvc file (or rather, any particular release of train.dvc) and place it in the application project.
Finally, execute dvc pull train.dvcto download all output files, e.g. the model.onnx binary, defined in train.dvc file.

In many cases, the model binary is not the only output file of the ML pipeline. For example, we also generate a model.config file which contains parameters to score data with the model. Depending on the context, various files and artifacts from the ML pipeline are required to build an application. By employing DVC ML pipelines, the .dvc file already defines the collection of the final output files of our ML pipeline and hence there is no need to additionally create a bundle (e.g. a zip file) with all relevant artifacts.

Note: The straight forward way to retrieve a particular binary file is to use the dvc get command, see the documentation . This command ‘downloads a file from a DVC project’, where the desired revision of the file is determined by the --rev parameter.

Reproducibility

The use of DVC to version the data and the training result allows for reproduction of all the results for specific releases. This is particularly relevant if the training results are used inside applications. In case there is a problem with the model in production, it is easily possible to reproduce the results for examination. In this case we use git checkout release-tag followed by dvc repro train.dvcand DVC will carry out all steps to reproduce the results automatically.

Automated testing

Our pipeline does not employ any kind of automated testing. Our only indication of failure is a process exiting with an error, that in turn fails the entire pipeline. For productively developing an ML pipeline, a smoke test is desirable: whenever a commit is pushed to the GitLab repository, the entire pipeline should be running on reduced data. If the pipeline succeeds, we can be confident that the training process on the full data set succeeds (and does not fail ‘last-minute’ after days of computation).

Some setups might profit from automatic detection of performance degradation. For example, an additional pipeline stage fails if, say, accuracy of the newly trained model is worse than the highest achieved previously.

Conclusion

In this blogpost we discussed a basic setup which allows to automate the training of machine learning models in production environments. We also provided an initial implementation of the setup and gave some insights in our reasoning. The setup should be considered as a possible starting point to build an automation setup for other projects. Typically, there is not one setup which is best for all projects and appropriate adjustments and considerations are required. We have already built similar model training pipelines based on these ideas for our customers which are used successfully in production.

Besides customized solutions there are also several existing cloud solutions such as AWS Sagemaker and Google Cloud AI Platform which tackle the model training (and even the model serving) under given framework conditions. Depending on the use case and the data involved it makes good sense to use cloud services, however, this discussion is a topic for an additional blog post.

Was this post helpful?

Blog authors

Marcel Mikl

Service Lead Data & ML & AI

Do you still have questions? Just send me a message.

Bert Besser

Do you still have questions? Just send me a message.

fromMarcel Mikl & Bert Besser

DISH-O-TRON – Train that vision model!

With this article we continue our endeavor of building dish-o-tron – an AI system designed to prevent the sudden appearance of dirty dishes in the community kitchen sink, and hence turning the community kitchen into a place of peace and harmony. This...

AI
Computer Vision

11.10.2020 | 11 minutes reading time

Marcel Mikl

Oliver Moser

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 minutes reading time

Marcel Mikl

Oliver Moser

DISH-O-TRON – No more dirty dishes thanks to AI

Sadly, to tell you the truth, doing dishes is still a thing. However, so far most of our readers still like our non-standard Deep Learning tutorial. Typically, AI is demonstrated as solving various toy problems. AI plays chess and Go, AI plays video ...

10.9.2020 | 7 minutes reading time

Marcel Mikl

Oliver Moser

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

Great Expectations: Validating datasets in machine learning pipelines

Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true...

Python
Data
Machine Learning

17.2.2020 | 6 minutes reading time

Marcel Mikl

DVC dependency management – a guide

This post is a follow-up to A walkthrough of DVC that deals with managing dependencies between DVC projects. In particular, this follow-up is about importing specific versions of an artifact (e.g. a trained model or a dataset) from one DVC project into...

Data
AI
Machine Learning

26.8.2019 | 10 minutes reading time

Bert Besser

Veronika Schwan

A walkthrough of DVC

This post is on how to systematially organize Machine Learning (ML) model development. A model’s performance improves when, e.g., you tune its parameters or when more training data becomes available. To measure improvement, you should track at least ...

Data
AI
Machine Learning
Python

13.3.2019 | 12 minutes reading time

Bert Besser

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 [Missing String "readingTime"]

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 [Missing String "readingTime"]

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 [Missing String "readingTime"]

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 [Missing String "readingTime"]

Answer questions about your documents with OpenAI and Pinecone

In recent years, large language models (LLMs) have made remarkable progress in interacting with humans, showcasing their ability to answer a wide array of questions. Trained on publicly accessible internet content, these models have broad knowledge across...

13.11.2023 | 12 [Missing String "readingTime"]

Lukas Lehmann

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Fighting Gandalf with magic spells (the spells are prompt injections) ...

Note: Do not attack any systems for which you do not have explicit permission to do so. In this article, I will recount the tale of outwitting a large language model by performing prompt injection attacks. Before we start, let's establish a common baseline...

IT-Security
AI

10.7.2023 | 12 [Missing String "readingTime"]

Michael Wagner

Charge your APIs Volume 9: Perfecting APIOps - API Monitoring with Checkly

Over the past series of blog posts, we've been exploring the fascinating world of API Operations (APIOps), diving deep into Continuous Integration, Continuous Deployment, load testing, API diffing, and API Portals and Marketplaces. We've built a robust...

GitHub
API
CI/CD

5.7.2023 | 3 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs Volume 8: Expanding APIOps - API Portals and Marketplaces

In our previous blog posts, we've taken an exciting journey through the world of API Operations (APIOps), exploring concepts like Continuous Integration, Continuous Deployment, load testing with k6, and API diffing with Tufin/oasdiff. By integrating ...

GitHub
API
CI/CD

28.6.2023 | 2 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs Volume 7: Enhancing APIOps - API Diffing with Tufin/oasdiff

Throughout our exploration of API Operations (APIOps), we've covered a range of concepts - from Continuous Integration and Deployment to API testing under stress. These pillars of APIOps have brought us invaluable insights, helping to streamline our ...

API
GitHub
CI/CD

21.6.2023 | 2 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs Volume 5: Taking APIOps with Continuous Deployment to...

In our previous exploration of API Operations (APIOps), we navigated the landscape of "Streamlining your API Operations with Continuous Integration", delving into how this practice can refine our approach towards API development and management. We examined...

CI/CD
API
GitHub

7.6.2023 | 3 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs Volume 1 - Build a documentation system with mkdocs

Following last week's launch of a new series of articles on LinkedIn, there is now a refresh under a new name on the codecentric blog. Why "Charge your APIs"? Quite simply, we all need to think APIs further and not just see them as a vehicle for integration...

API
Documentation
Product management
Git

10.5.2023 | 5 [Missing String "readingTime"]

Daniel Kocot

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 [Missing String "readingTime"]

Denis Stalz-John

DOTNET CI/CD with Gitlab

While CI/CD is easy for .NET if you use Github, it's much more work if you are on Gitlab. While it's possible it's a lot of moving parts and I hope to simplify the process a little bit. Currently, I want a couple of things out of a basic CI/CD pipeline...

GitLab
CI/CD
.NET

10.1.2023 | 6 [Missing String "readingTime"]

Introduction to GitOps with ArgoCD

In this post you will learn what GitOps is about and see the steps to create a setup on your laptop to gain some experience with ArgoCD. Using an industry standard container orchestrator such as Kubernetes, this enables developers to continuously deploy...

CI/CD
Kubernetes
GitHub
Open Source
DevOps
Container
Infrastructure as Code
Infrastructure
Spring

31.10.2022 | 10 [Missing String "readingTime"]

Open Policy Agent – Primer

The Open Policy Agent (OPA) is a general-purpose, open-source policy engine, i.e. a collection of components that allows for a uniform and efficient implementation of rules of all kinds. This article shows a small practical example. When was the last...

CI/CD
Software architecture
IT-Security

19.10.2022 | 5 [Missing String "readingTime"]

Marco Paga

The state of APIOps and the deployment of API definitions

Having learned in one of my posts on Medium that API design is not really an easy task and involves a lot of work, also mentioned in my last post here on the blog, I'm going to move on to another complicated area of APIs, APIOps and, in more detail, ...

API
CI/CD
DevOps

12.10.2022 | 7 [Missing String "readingTime"]

Daniel Kocot

Heroku is dead: Let’s deploy Spring Boot containers on fly.io!

Heroku is cancelling their free plan! What about all my open-source projects? Luckily fly.io comes to the rescue! Here are the missing docs on how to run Spring Boot on fly.io.Why I love(d) HerokuHeroku was my go-to PaaS for open-source projects for ...

CI/CD
Java
Cloud
DevOps
Spring

18.9.2022 | 17 [Missing String "readingTime"]

Platform Engineering – A primer

Currently, platform engineering is a topic that is causing a lot of reactions in the vastness of the World Wide Web. Especially for customers from the enterprise environment, it leads to interesting side effects when DevOps teams suddenly turn into Platform...

Accelerate
CI/CD
DevOps

12.9.2022 | 5 [Missing String "readingTime"]

Daniel Kocot

Remote training with GitLab-CI and DVC

DVC remote training: The high-level idea

Key aspects

The code repository

The ML pipeline

Remote training

Working in a team

The CI pipeline

Stage 1: Building the training image

Stage 2: Training the model

Training configuration

Release preparation

Stage 3: Releasing the model

Further Thoughts

Separation of concerns

Using the trained model in an application

Reproducibility

Automated testing

Conclusion

Was this post helpful?

Blog authors

More articles

DISH-O-TRON – Train that vision model!

DISH-O-TRON – Gather that DATA you must!

DISH-O-TRON – No more dirty dishes thanks to AI

Thinking AI means re-thinking data

Great Expectations: Validating datasets in machine learning pipelines

DVC dependency management – a guide

A walkthrough of DVC

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Simplifying LLM Application Development: A Newcomer's Perspective

Function Calling with GPT Models

Answer questions about your documents with OpenAI and Pinecone

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Fighting Gandalf with magic spells (the spells are prompt injections) ...

Charge your APIs Volume 9: Perfecting APIOps - API Monitoring with Checkly

Charge your APIs Volume 8: Expanding APIOps - API Portals and Marketplaces

Charge your APIs Volume 7: Enhancing APIOps - API Diffing with Tufin/oasdiff

Charge your APIs Volume 5: Taking APIOps with Continuous Deployment to...

Charge your APIs Volume 1 - Build a documentation system with mkdocs

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

DOTNET CI/CD with Gitlab

Introduction to GitOps with ArgoCD

Open Policy Agent – Primer

The state of APIOps and the deployment of API definitions

Heroku is dead: Let’s deploy Spring Boot containers on fly.io!

Platform Engineering – A primer