Thinking AI means re-thinking data

27.5.2020 | 7 minutes reading time

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid data pipelines are required.

A truth about data science projects

Consider the following story: We start a data science project with a time-boxed proof of concept (PoC) having a skillful and enthusiastic project team that generates insights in fast iterations. After our time box ends, we achieved promising results and get the opportunity to present them in front of “the decision makers”. Luckily, we ace our presentation and it becomes more and more clear: “this idea is a winner!”. Everyone is feeling great.

However, under the hood, the promising results were achieved on a local machine with data dumps (CSV files) from multiple data sources carefully hand-crafted by an SQL expert with a special series of magical queries. Clearly, it is often very useful to work with data dumps for fast iterations, but clearly it is also a shortcut.

This shortcut becomes a problem when it is ignored. Even worse, when it is not properly communicated and understood by “the decision makers”. The trade-off for fast insights and results during the PoC was not considering data infrastructure and building proper data pipelines. Now, after everyone is hyped about the successful presentation and expects business value within “a few days”, we have to face the truth – we took a shortcut and before we can generate business value, we have to fix it.

A heavy burden for data science projects

Because it is possible to have short-term success as sketched in our fictional story, often times, the AI hype drives companies to start data science projects before evaluating and thinking about the current state of their data infrastructure. This can be useful in order to validate ideas. However, it becomes a problem if not all stakeholders are aware that the implications, i.e., building a production-grade machine learning application, are (depending on the data infrastructure), more probably than not, a challenging next step.

At worst, outdated and missing data infrastructure can lead to the bizarre situation where a data science project team is faced with the task to casually solve company-wide data infrastructure challenges. On the quest for generating business value after a successful PoC, the project team has to deal with organisational issues without much public awareness, and hence without much support – possibly resulting in frustration and failure to generate business value with the once so promising idea.

Metaphorically speaking, the team is driving a sports car (the shiny new machine learning model) on a forest path (the data infrastructure) and everyone around wonders why they do not run on full speed resulting in the impression that it is simply not possible to generate actual business value with data science projects.

The attitude towards data infrastructure

Historically, setting up a new data infrastructure is associated with great pain. This is because it is typically related to huge investments of time and money. Furthermore, there are various paradigms, e.g. data lakes (see here and here ), data lakehousing and data meshes each with different sets of benefits and trade-offs to consider. However, improving the data infrastructure is expressly not solely a technical issue. In fact, what is even more important is changing the mindset about the significance of data in the company. This change is likely the single most crucial step for generating sustainable and growing business value with data science projects.

There are (at least) two reasons to rethink data infrastructure:

Faster cycle time for validation

Clearly, starting with a PoC as described above yields first insights as to whether or not an idea is promising. However, without proper data infrastructure, it is usually hard and costly to validate an idea end-to-end. This usually leads to lengthy discussions about if and when to start without actually starting to learn valuable lessons about the idea. Having proper data infrastructure allowing for fast end-to-end validation and experimentation enables one to dismiss unsuccessful ideas faster and makes it possible to single out the successful ones. The ability to validate ideas with fast end-to-end experiments is particularly crucial for building successful machine learning applications because there are many moving parts and the application typically affects various parts of the company.

True value is the ability to utilize machine learning algorithms

Typically, there are plenty of ideas on how to use data science to generate business value in a company. More often than not, there are also known machine learning algorithms already used by others to implement the same or similar ideas. Hence, there is legitimate reason to assume that this known algorithm supposedly would work at least well enough for a first version of the application. However, many times it is not the knowledge about the algorithm, but the availability of data and the ability to integrate the algorithm end-to-end into the company infrastructure in a painless way which decide if business value can be created.

(Re-)thinking data

Starting to (re-)think data is not only a technical issue; first it amounts to asking questions like

What data is available?
What quality does the data have?
Who owns the data?
Who controls access to the data?
Who is responsible for the data and its quality?
What is the current relationship between producers and consumers of data?
How can we evolve and improve this crucial relationship?

Here, the main driver for answering these questions is to find ways to reduce the overall end-to-end cycle time for the development of machine learning applications by improving, for example, the availability, the processing speed and the quality of data. In most cases, we do not recommend to actually solve all data infrastructure problems upfront and then start data science projects. In fact, many issues are typically unknown at this point and it is very challenging to find suitable solutions without implementing specific applications.

Hence, it is perfectly fine to start rethinking data on a large scale and act small on a use case by use case basis. For one thing, this allows validating different technical approaches. For another (this is probably more important), it allows changing the mindset about the significance of data step by step. The focus here is on implementing use cases end-to-end learning, in particular, to build production-grade machine learning applications. At this stage, using cloud infrastructure offers a good way to move quickly without much ramp-up. As soon as several use cases create business value, unifying the data infrastructure might yield additional value.

Summary

While everyone is talking about AI, in the end it takes much more than data science and machine learning algorithms to build production-grade machine learning applications and generate business value. Moreover, sustainable business value with machine learning and AI requires more than clever algorithms – it requires rethinking data.

In fact, many state-of-the-art algorithms are publicly available, e.g. the Google search algorithm and even the patent is expired , the state-of-the-art model for natural language processing BERT is open source and the network architecture for state-of-the-art image classification ResNet is available and even implemented in various frameworks like Keras and Pytorch.

However, the major part of the business value is not the algorithm but the capability to employ the algorithm in a production-grade machine learning application, and hence the true value lies in the underlying data (often not open source) and the data infrastructure.

Here, a big bang attempt to change the data infrastructure or start by pouring all available data into a single place is more often than not an unnecessarily challenging approach. Instead, starting to rethink data – use case by use case – and building a suitable data infrastructure on the way appears to be much more promising.

If you are interested in some more ideas and our approach to data science projects, see the blog post (German) and the free on-demand webinar (German).

Was this post helpful?

Blog author

Marcel Mikl

Service Lead Data & ML & AI

Do you still have questions? Just send me a message.

fromMarcel Mikl

DISH-O-TRON – Train that vision model!

With this article we continue our endeavor of building dish-o-tron – an AI system designed to prevent the sudden appearance of dirty dishes in the community kitchen sink, and hence turning the community kitchen into a place of peace and harmony. This...

AI
Computer Vision

11.10.2020 | 11 minutes reading time

Marcel Mikl

Oliver Moser

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 minutes reading time

Marcel Mikl

Oliver Moser

DISH-O-TRON – No more dirty dishes thanks to AI

Sadly, to tell you the truth, doing dishes is still a thing. However, so far most of our readers still like our non-standard Deep Learning tutorial. Typically, AI is demonstrated as solving various toy problems. AI plays chess and Go, AI plays video ...

10.9.2020 | 7 minutes reading time

Marcel Mikl

Oliver Moser

Great Expectations: Validating datasets in machine learning pipelines

Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true...

Python
Data
Machine Learning

17.2.2020 | 6 minutes reading time

Marcel Mikl

Remote training with GitLab-CI and DVC

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or...

Git
Machine Learning
CI/CD
AI
GitLab

27.1.2020 | 15 minutes reading time

Marcel Mikl

Bert Besser

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 [Missing String "readingTime"]

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 [Missing String "readingTime"]

Daniel Kocot

Miriam Greis

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 [Missing String "readingTime"]

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 [Missing String "readingTime"]

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 [Missing String "readingTime"]

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 [Missing String "readingTime"]

Daniel Kocot

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering...

Generative AI
AI

6.12.2024 | 13 [Missing String "readingTime"]

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 [Missing String "readingTime"]

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 [Missing String "readingTime"]

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 [Missing String "readingTime"]

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 [Missing String "readingTime"]

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Answer questions about your documents with OpenAI and Pinecone

In recent years, large language models (LLMs) have made remarkable progress in interacting with humans, showcasing their ability to answer a wide array of questions. Trained on publicly accessible internet content, these models have broad knowledge across...

13.11.2023 | 12 [Missing String "readingTime"]

Lukas Lehmann

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 [Missing String "readingTime"]

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Charge your APIs Volume 13: Data meets APIOps

In the swirling digital vortex that modern businesses navigate, two things stand clear as day: our escalating reliance on Application Programming Interfaces (APIs) and the immeasurable value of data. The API Operations (APIOps) pipeline, with its automated...

API
Data

24.8.2023 | 11 [Missing String "readingTime"]

Daniel Kocot

Thinking AI means re-thinking data

A truth about data science projects

A heavy burden for data science projects

The attitude towards data infrastructure

Faster cycle time for validation

True value is the ability to utilize machine learning algorithms

(Re-)thinking data

Summary

Was this post helpful?

Blog author

More articles

DISH-O-TRON – Train that vision model!

DISH-O-TRON – Gather that DATA you must!

DISH-O-TRON – No more dirty dishes thanks to AI

Great Expectations: Validating datasets in machine learning pipelines

Remote training with GitLab-CI and DVC

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

Simplifying LLM Application Development: A Newcomer's Perspective

Function Calling with GPT Models

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Answer questions about your documents with OpenAI and Pinecone

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 13: Data meets APIOps