Machine learning: Moving from experiments to production

19.3.2019 | 13 minutes reading time

Judging by the many 5-minute tutorials for bringing a trained model into production, such a move should be an easy task. However, there are many different libraries and products popping up lately, indicating that everyone – including tech giants – has different opinions on how to build production-ready machine learning (ML) pipelines that support today’s fast release cycles. So, not that easy after all? It’s actually quite hard for reasons I will point out. While you could invest in an all-in-one solution, it may be difficult to justify the costs in early adoption stages.

I invite you to join me as I go back to the drawing board and think about a sane approach to planning an ML pipeline that fits your organization’s needs. This blog post will be part one of a series about bringing models to production. Today, we will look at the goals you may want to achieve with an ML pipeline, different technical approaches, and an example architecture. In upcoming posts, we will focus more on hands-on technical implementations. Chances are that you will be picking up valuable orientation advice on the way for your transition!

Why a machine learning pipeline is important

As the topic receives more and more coverage in technical literature, an increasing number of companies begin experimenting with it and evaluate possible applications in their business domains. While these proof of concepts might yield promising results, there often remains confusion about how to integrate resulting models into existing systems and processes.

A bar diagram showing that 49% of organizations are still exploring ML, 36% have 2+ year of production experience and 16% have 5+ years.

A recent survey from O`Reilly (August 2018) shows that 49% of the participating organizations are experimenting with machine learning, but do not have models in production.

However, reaching production – even only with an MVP – is the best way to gain internal or external attention and funding. To avoid reinventing the wheel for every new application, an efficient way is to build a central pipeline that defines a clear path for making trained models available to use, while still allowing flexible experimentation during model development. Such a pipeline handles many recurring tasks and manages compute resources so that data scientists and engineers are able to focus on the specifics of their current application instead of thinking about “housekeeping”. This most likely leads to faster release cycles.

Another challenge is the knowledge gap often present in organizations: data scientists lack engineering knowledge whereas engineers lack insight into model creation. A pipeline helps to clarify the process and defines clear areas of responsibility and artifact formats.

Desirable goals for a machine learning pipeline

It’s tempting to dive right into one of the many all-in-one solutions that exist out there. While an all-in-one solution can be a viable choice, I encourage you to first think about your organization’s needs and the skills of your ML team. Here are some important aspects to consider:

Lifecycle Coverage – Ideally the pipeline spans all the way from initial experiments to the deployment of a model. This makes onboarding and collaboration easier and each model follows a clear path.

Freedom During Experimentation – Standardized procedures (such as a pipeline) tend to not only introduce order and coherence but also annoy people if they are too restrictive. Make sure to involve your data scientists to determine an appropriate level of flexibility for experimenting and creating models. Settling on a specific ML library (e. g. TensorFlow) is more restrictive than settling on one programming language (e. g. Python). Using containers is even less restrictive. However, the more restrictive the better the pipeline can be optimized to library-specifics.

Tracking Experiments – Storing the model code, hyperparameters, and result metrics of every experiment is important to be able to discuss results, decide in which direction to go next or reproduce experiments. The simplest format for this could be a shared spreadsheet, but more sophisticated options are available – though they often come with additional requirements for the rest of the pipeline. As an example, here is a screenshot from MLflow `s experiment tracking server UI:

Automation – Even when a lot of steps are performed manually in the beginning, make sure that all steps can be automated later on via APIs or certain tools. Repeated manual tasks are boring and error-prone.

Model and Code Versioning – Versioning the model code is essential for reproducing training runs. The pipeline should also be capable of managing versioned artifacts for at least every model that is supposed to be deployed in production. This is necessary for rollbacks and A/B tests in production.

Model Testing – Besides general accuracy metrics acquired during model training, testing a packaged model’s general serving ability as well as typical predictions can improve the release/deployment quality.

Scalability – If you invest into building an ML pipeline, you may want to eliminate the need to rebuild it just because it can’t handle increasing volumes of data or the demands of a growing team of data scientists. Make sure that the pipeline is scalable by choosing building blocks that are scalable themselves.

Security – The pipeline should adhere to current security standards regarding disk and transfer encryption as well as access authorization, especially when the training data and model is considered to be sensitive.

Monitoring – After a model has been deployed for serving, its usage and performance should be monitored to ensure proper operation, enable dynamic scaling based on load, or even use this data to improve future model versions.

Designing a pipeline

With our goals in place, we are now able to design a pipeline that satisfies them. In the following I will assume that we want to deploy our models as servables, exposing an API. For other deployment targets (e. g. mobile on-device predictions), the build process would need to be modified accordingly. Also, I will not get into details on data sources for model training, since they are not directly related to model deployment. Here is a tech-agnostic version of the pipeline which highlights model-related data flow using yellow arrows:

Just like we wanted, the pipeline spans all the way from model training experiments to serving deployed models in production. Data scientists are able to experiment using interactive notebooks and write training jobs (1) while their code is being versioned in a source code repository (2). Ideally, they are able to execute their training jobs inside a compute cluster (3) for speeding up demanding tasks like automatic hyperparameter optimization.

All completed training jobs push the trained model along with performance metrics, used hyperparameters, dataset information, and the code revision/commit hash into a separate store (4). This store should make searching and comparing different models as easy as possible. If an organization has distinct scientist and engineer roles, this store could function as a clear interface between them.

Once candidates from the model store are ready to be moved to production, a build server (5) tests the model for typical or production-critical cases and packages it as a deployable and uniquely versioned artifact (e. g. executable binary or container image). The artifact gets stored in an immutable repository/registry (6). Then, the build server may also deploy a model automatically by installing the artifact on one or multiple target machines or triggering containers to be run based on the newly created image (7). The entire build configuration of each model can be placed in the same source code repository as the training code.

By gathering feedback from users of the model, utilizing A/B testing, and collecting monitoring metrics, we can continuously improve the model and manage the lifecycle of individual versions (8).

If you look back on our goals, we have tackled a lot of them with this pipeline. We are versioning model code, build server configuration and released models. We are also tracking all experiments while still retaining freedom during experimentation. Scalability, automatability, and security hugely depend on the properties and capabilities of chosen technologies. However, scaling models in serving is often a trivial task, since they usually do not contain any runtime state. Running multiple instances of the same model with a load balancer in front is usually enough.

Choosing the right building blocks

After all these generic and theoretical plans, let’s talk tech. The next step is to cover every aspect and step of this pipeline using technical solutions. This is no easy task since they have to be scalable to your current and future needs, interoperate with each other and lie within your budget. Take your time, evaluate different constellations, and have a very good understanding of what knowledge already exists in your organization and which technologies your team is motivated to work with.

I will not talk about different solutions to host source code repos or run build servers. However, you can see valid choices for that in my example later on. Let’s first focus on the most popular ML-specific components.

Open Source platforms

Platform solutions span multiple stages of the ML pipeline resulting in a good coherence, but each one dictates a certain way to work.

MLflow	Offers an experiment store, model serving, and supports Python-based training using all major frameworks. Makes heavy use of Anaconda. Worth a look if you are using Python and a variety of frameworks. In my opinion the most versatile open source platform solution with a very good UI to browse experiment data.
KubeFlow	Offers Jupyter notebooks, training, hyperparameter tuning, experiment store, and model serving – all hosted on Kubernetes. Focused on, but not limited to TensorFlow. Worth a look if you are using TensorFlow and Kubernetes.
Apache PredictionIO	Offers training, hyperparameter tuning, experiment store, and model serving. Works with Spark-based models. Includes an “event server” for collecting events as training data. Worth a look if you use Spark and find the concept of an included event server a good fit for your use cases.

Cloud platforms

Every major cloud provider offers a managed ML platform among their services that spans from notebooks to serving models. Using them can be beneficial when most training data already lies in the cloud or training the models requires a lot of computing power while the organization lacks suitable resources.

Google ML Engine	Offers TPUs, which are even more powerful when training complex models than GPUs. Lacks broad framework support (full support only for TensorFlow and scikit) and a comfortable experiment store. With the “AutoML” products, models can be trained for certain problems without requiring ML knowledge and programming.
Amazon SageMaker	In my opinion the most versatile cloud solution with a wide variety of supported frameworks. Additional services like “Ground Truth” for labeling datasets and pre-built algorithms.
Azure ML Service	Framework support and workflow comparable to Amazon’s SageMaker. “Machine Learning Studio” enables you to build models using a graphical editor.

Fully custom and Open Source

Of course there is the option to build a completely custom solution. This can be beneficial if platform solutions are too much overhead or not flexible enough. There may also be a lot of open-source expertise or affinity in the organization.

scikit-learn	A lightweight and easy-to-use library containing a variety of different ML algorithms, metric evaluations and visualizations.
TensorFlow + TFX	The most popular framework for neural networks while also supporting other algorithms and custom compute graphs. Can operate in a highly distributed setting. TFX adds functionality for production use, such as a generic model server and consistent feature preprocessing during training and serving.
Keras	A library that offers a high-level, user-friendly API for creating neural networks. Needs either TensorFlow, Theano or CNTK under the hood as its “backend”. Keras will be the official standard high-level API for TensorFlow 2.0+ and comes already packaged with it.
PyTorch	A distributed framework for neural networks that has similar capabilities as TensorFlow, but a slightly easier usage and learning curve. Due to it being newer and less popular, the ecosystem and resources are not as extensive.
Apache Spark	A JVM-based (Scala/Java) big data processing framework with a Python API and a collection of ML algorithms (MLlib). Can be executed locally, in a Spark cluster or on Hadoop. Has especially powerful preprocessing capabilities.
Sacred	A framework-independent tool for storing experiment metrics. Typically uses MongoDB as data store. The web UI Omniboard makes stored experiments browsable and searchable.
Ray Tune	A distributed hyperparameter tuning library with a variety of different optimization techniques. Does not depend on a specific ML framework.
hyperopt	Lightweight, framework-independent hyperparameter tuning library.

The list above is by no means complete, but it surely contains the popular choices in the respective categories and is a good way to begin evaluating. Of course, writing components on your own is always an option: I’ve seen multiple times that organizations have built their own GUI to abstract away complexity.

Example: Open Source ML pipeline using Python

Here is a possible pipeline that consists entirely of either free or open source components and settles on Python as the common denominator for all models.

Each model project has its own repository in a self-hosted GitLab. Since we decided that all models will be created using Python, we have a convention that the repository has to contain a Pipfile (created using pipenv ) describing the desired Python environment. Our data scientists work on the model code by experimenting using Jupyter notebooks and running training jobs locally. The training jobs use Sacred to write information about each training run (Git hash, parameters, metrics, etc.) into a MongoDB database. The resulting model gets transferred to a central Ceph file store and referenced in Sacred’s run information. The experiments can be searched and compared via the web interface of a hosted Omniboard instance. Remote execution highly depends on the framework, but we have a GPU-enabled Kubernetes cluster that can be used for accelerated and distributed training.

Our data engineers configure build pipelines in our Jenkins-based build environment that create Docker images from a given model of our model store. The image wraps the model and exposes it via an API. The build pipelines are usually different for each model, so their files and all files required for building the Docker image are checked into the code repository as well. Jenkins will also test the image, push it to a self-hosted Docker registry and deploy it in our production Kubernetes cluster.

Key takeaways

We have seen that building an ML pipeline is not an easy task. There are countless building blocks to choose from in a field that is constantly changing. Hopefully, I was able to give you a good overview and a foundation to start planning a pipeline for your organization. I want to leave you with some key takeaways:

Involve as many people as possible in the planning phase – both scientists and engineers – since their adoption of the pipeline is crucial. Existing expertise could be a factor when choosing technologies. Also, find out which technologies your team is motivated to work with.
Make sure that your pipelines and the components involved are scalable enough to handle your organization’s ML demands for the foreseeable future.
A well-crafted ML pipeline enables fast iterations on models and brings them into production. This can be a huge advantage if you have the need for fast release cycles and the amount of data and feedback to support it.
One particular technical challenge often faced and highlighted is how to ensure that the same data transformations that are applied during training are also applied on prediction input. This includes constants that were computed during training (e. g. for normalization). Consider this early on in your planning phase. If you are using TensorFlow, for example, take a look at TensorFlow Transform which addresses this challenge.

More on the topic

Thank you for staying with me all the way! Stay tuned for more blog articles of this series, where we will dive deeper into technical details by showing hands-on examples. If you are interested in German e-learning content on machine learning, make sure to visit codecentric.ai .

Was this post helpful?

Blog author

Roman Seyffarth

Senior Software Engineer

Do you still have questions? Just send me a message.

fromRoman Seyffarth

Kubernetes Operators: Deploying Helm Charts without Tiller

Congratulations, you have just finished the first shippable version of your software product. You created container images for your software and want to make deploying to Kubernetes as simple as possible. You could provide plain YAML files along with...

Kubernetes

25.6.2019 | 8 minutes reading time

Roman Seyffarth

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Pull off Architecture Reviews at Light-Speed with LASR!

Foreword: This blog is loosely based on a recent project experience. All persons, companies and names are fictitious, as to make them NDA compliant. Any resemblance to a person, existing company or brand is purely coincidental and unintentional.For most...

Software architecture

4.4.2025 | 13 minutes reading time

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Feature-Sliced Design and what we need for good frontend architecture

Feature-Sliced Design and what we need for good frontend architecture While a lot has been published on the topic of software architecture in the backend, and there are well-established best practices, this topic is less prominent for frontend applications...

Software architecture
Frontend

23.1.2025 | 10 minutes reading time

Hexagonal Architecture is just an island

Imagine an island called "Alistair Island." This island is a vibrant place with houses, fertile soil, and a well-coordinated community of residents who live by well-defined routines. Every activity on the island has significance and serves a specific...

Software architecture
Testing
Software development

22.1.2025 | 10 minutes reading time

Danny Keller

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Modularization the easy way: Spring Modulith with Kotlin and Hexagonal...

Modularization the easy way: Spring Modulith with Kotlin and Hexagonal Architecture Modularization is a key concept in modern software development to make applications maintainable, testable and flexible. In this article we will see how Spring Modulith...

Software architecture
Kotlin
Spring

14.1.2025 | 9 minutes reading time

Danny Keller

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

ArchUnit in practice: Keep your Architecture Clean

Who hasn’t been there: A new project kicks off or the old code finally needs a cleanup. A big meeting with all the developers is called: “This time, we’ll do it right—clean, correct, and structured!” Architecture Decision Records (ADRs) are created to...

Software architecture
Java
Kotlin
Software development

20.9.2024 | 18 minutes reading time

Danny Keller

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

API gateways are essential for managing and securing data flow between services. As software architectures evolve, different types of API gateways have emerged to address specific challenges: Legacy, Agnostic, and Kubernetes-native. Drawing on insights...

API
Software architecture
Infrastructure
Integration

21.8.2024 | 12 minutes reading time

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Exploring Dapr: A Deep Dive into Distributed Application Runtime

In a recent blog post, we introduced Dapr (Distributed Application Runtime) and highlighted its potential as a valuable tool for cloud-native applications, in combination with Aspire. This post dives deeper into the inner workings of Dapr, explaining...

Software development
Cloud native
Software architecture
Open Source

10.7.2024 | 10 minutes reading time

Manuel Zapf

Spring Boot and HTMX: The boring app

Motivation Most apps I touched in the wild follow the same two tiered approach. A backend delivering JSON (some may call this REST) and a frontend framework, consuming JSON from the backend converting it to the HTML displayed to the user. Worst case,...

Software architecture
Software development
Spring
Kotlin

28.6.2024 | 16 minutes reading time

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

I recall the days when writing a web application in C# with .NET meant deploying it on an IIS web server for accessibility. Today, this approach seems outdated, especially with the shift towards microservice-based architectures. Fortunately, Microsoft...

Software architecture
Open Source
Cloud
Microservices
Infrastructure as Code
.NET
Cloud native

27.6.2024 | 8 minutes reading time

Manuel Zapf

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Zero Trust Azure Identity & Access Architecture

Falko Lehmann and Hendrik Kamp have already explained in their blog post on Zero-trust Architecture why zero-trust security models are preferable to traditional perimeter security models in order to minimize damage from cyber attacks. Falko and Hendrik...

IT-Security
IAM
Azure
Software architecture

4.6.2024 | 14 minutes reading time

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

Machine learning: Moving from experiments to production

Why a machine learning pipeline is important

Desirable goals for a machine learning pipeline

Designing a pipeline

Choosing the right building blocks

Open Source platforms

Cloud platforms

Fully custom and Open Source

Example: Open Source ML pipeline using Python

Key takeaways

More on the topic

Was this post helpful?

Blog author

More articles

Kubernetes Operators: Deploying Helm Charts without Tiller

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Pull off Architecture Reviews at Light-Speed with LASR!

Introducing Data Interface Quadrants (DIQs)

Feature-Sliced Design and what we need for good frontend architecture

Hexagonal Architecture is just an island

Access Databricks UnityCatalog from duckdb

Modularization the easy way: Spring Modulith with Kotlin and Hexagonal...

Charge your APIs Volume 36 - Trends for 2025

ArchUnit in practice: Keep your Architecture Clean

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Exploring Dapr: A Deep Dive into Distributed Application Runtime

Spring Boot and HTMX: The boring app

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Zero Trust Azure Identity & Access Architecture

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook