An introduction to federated learning in an industrial context: Advanced

18.9.2023 | 9 minutes reading time

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous model attacks can jeopardize the training data, from membership inference attacks to property inference attacks (Stock et al.). While we won't dive into the specifics of these attacks, the following example illustrates how models can memorize data.

As you can see from the image (Ziegler et al. 13), the researchers were able to reconstruct the member of the input dataset. This was done by “applying reconstruction attacks to the local model updates from individual clients” (Ziegler et al. 1).

If you have not read the first part of this blog series, where I explain what federated learning is in an industrial context, here is the link to part 1.

What can we do?

There are many mitigation techniques to reduce the effectiveness of such attacks. Usually in practice, a combination of them will be used to maximize privacy, while also maintaining a feasible training time and accuracy. It is possible to group the available techniques into two broad categories. There are cryptography-based techniques, like homomorphic encryption and secure aggregation, and non-cryptographic methods, such as differential privacy and data anonymization.

Cryptography-based Techniques

It's worth pointing out that cryptography-based techniques are the focus of extensive research, and improvements are being made constantly. Let's have a look at what they are.

Secure Aggregation

Secure Aggregation is essentially a protocol to hide the origin of the model parameters and thus protects the individual clients. Its importance can be seen from the attacks described above. Parameters can be sensitive information about a client, so hiding its origin increases the privacy and reduces the attack surface. The available protocols include but are not limited to SecAgg/SecAgg+/LightSecAgg/FastSecAgg. Each protocol has its own pros and cons, the discussion of which is beyond the scope of this article.

A secure aggregation can be achieved by shuffling the updates among the clients so that at the time of aggregation, the server does not know where the updates originated from. This shuffling uses mathematical methods to ensure that all the updates can be reconstructed when the updates reach the server. Another way to implement secure aggregation is by using a trusted execution environment as a secure worker, where only the secure worker sees the origin of the updates and the server sees the final result. This approach has its own pros and cons but the rest of the section will focus more on the first technique since it is the most widespread approach.

One of the biggest challenges with secure aggregation is client dropouts and the consequences that arise from the dropouts. First, let's understand why dropouts are a problem at all and what consequences arise from this. Afterwards, we will have a quick look at why secure aggregation over multiple rounds is not very feasible with the traditional secure aggregation protocols we have.

Client Dropouts

It is often assumed that the clients have perfect availability, which is not always the case. Energy shortages, lack of internet access, and device maintenance are just a few of the reasons why clients might not always be available and drop out. In these situations, we can not access the client and cannot perform any training on it. As a result, the client is unable to share any parameters that it received from other clients in secure aggregation.

Multi-Round Secure Aggregation: Who cooked what?

Let's start off with the bad news: most secure aggregation protocols are not secure over multiple rounds. This problem is specific to the clients shuffling approach but it is a big one nevertheless. Most secure aggregation algorithms only consider how to shuffle the weights securely and hide the origin in a single round. The idea is that if we can get one round correct, then in a multi-round context, the process will just be repeated. Unfortunately, however, what happens is that in a multi-round context, the client dropouts have to be considered too. A simple analogy to describe this situation is cooking. Imagine that the tester is the server and there are three people (clients) who make three different pasta sauces: pesto, bolognese, and tomato sauce. In the first round, you sample all the sauces, not knowing which client made which sauce. In the following round, the person who cooked the pesto sauce is sick and cannot present their sauce. The other individuals tweak their sauces slightly before presenting them again. Even though you don't know who cooked which sauce, you're now aware that pesto is missing and can identify who was absent (a client dropout). This allows you to infer a correlation between the client and their updates. This is the conundrum that many secure aggregation algorithms face. While emerging algorithms aim to address this issue, they have yet to be adopted by any frameworks at the time of writing, according to my research.

Homomorphic Encryption

Stages of encryption

Encryption can occur in three distinct stages: at-rest, in-transit, and in-use. We use the first two heavily in our lives, sometimes without even knowing it.

At-rest encryption refers to the process in which a file, once encrypted, lies on the disk until decryption for use. Encryption in-transit describes encryption when two devices are communicating, so for example when you visit a website, TLS ensures that your connection stays secure during transit and no one can intercept the traffic. Encryption in-use describes the concept of being able to do calculations on encrypted data. Until quite recently, the in-use methods were not practical and you will see why in a second.

The role of Homomorphic encryption in Federated learning

Homomorphic encryption allows the entire aggregation process to take place on encrypted data. The server/aggregator never sees the model updates and also, unlike secure aggregation, the global model. The clients send their encrypted model updates to the server and then the server will aggregate the weights and send the global model, again encrypted, back to the clients. In the standard homomorphic encryption protocol, the clients share a private key and the public key is shared with the aggregator. The below diagram (Authors Work) demonstrates how the process takes place.

The specific implementation details are beyond the scope of this article but I encourage you to check out NVFlare’s implementation if you are looking for inspiration. (https://developer.nvidia.com/flare)

Limitations of Homomorphic encryption:

Using Homomorphic encryption is computationally expensive and requires significantly more storage space. Over the years, it has become more practical but it is still resource intensive. However, limitations are not limited to the complexity and storage space it requires; there are also technical limitations and security considerations that have to be taken into account.

Non-cryptographic Techniques:

Data anonymization techniques:

Anonymized data can never be completely anonymous. Research has repeatedly shown that anonymized datasets can be deanonymized with the help of some basic information about a person. In 2019, for example, the researchers from Imperial College London and Belgium’s Université Catholique de Louvain (UCLouvain) built a model that allowed them to correctly re-identify 99.98% of Americans in any dataset using 15 demographic attributes (Rocher et al. 2019). This example and others, such as the Netflix dataset 2008 case (Narayanan and Shmatikov), demonstrate that we can not fully trust the anonymization methods. At this point, I'd encourage you to take the “How unique am I?” test from AboutMyInfo.org to get a sense of how easy it is to deanonymize data and identify an individual in a group.

But, what if we could know how much privacy would be lost in the worst-case scenario?

Differential Privacy

Differential privacy is all about taking calculated risks. It allows us to quantify, using the privacy budget ε, the maximum amount of privacy loss incurred when a differential privacy algorithm is used. Then we start adding noise to the data and stop when we find a reasonable balance between the privacy loss and the useful information being provided to the model. Here, with the data, our focus is on the distribution of what is related and not the individual data points. Adding noise will almost always reduce the accuracy of the model but some accuracy loss is most of the time tolerable, especially considering the benefits of increased privacy. Another benefit of adding noise is that it might help avoid overfitting since the influence of a single data point will be less after the noise. This is good because we are usually always interested in robust models that learn the distribution and generalize well.

Limitations

Differential privacy is only helpful for adding noise but not for protecting the entire training set. As mentioned earlier, it might also reduce the accuracy. If such concerns are relevant, then cryptography-based techniques might be better suited for your use case. As mentioned earlier, differential privacy can also come at the expense of model accuracy, especially with underrepresented subgroups.

Conclusion

As you can see, federated learning is a paradigm shift that shapes how we handle data. It is not a cutting edge approach that only exists in research papers. While it is still being extensively researched and pushed to its limits, it is already usable today. It has a huge community of practitioners and researchers around the world that make it possible to use it easily using frameworks.

As with any emerging technology, navigating the landscape will require a mix of technical understanding and careful risk management. There is a delicate balancing of security, privacy and usability of the data. In an industrial context, questions such as sensitivity of data, availability of computational resources as well as the threat model of your organization are especially relevant. It is also important to point out that without traditional security measures, the advanced techniques to protect the dataset are ineffective.

I hope the introduction was clear and you're now ready to unlock the potential of your data. If you have any questions or suggestions, don't hesitate to reach out!

References:

Narayanan, Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE, 2008.
Stock, Joshua, et al. "Studie zur fachlichen Einschätzung und Prüfung des Potenzials von Federated-Learning-Algorithmen in der amtlichen Statistik." (2023).
Rocher, Luc, Julien Hendrickx, and Yves-Alexandre Montjoye. "Estimating the success of re-identifications in incomplete datasets using generative models." Nature Communications, vol. 10, 2019, doi: 10.1038/s41467-019-10933-3.
Ziegler, Joceline, et al. "Defending against reconstruction attacks through differentially private federated learning for classification of heterogeneous chest x-ray data." Sensors 22.14 (2022): 5195.

Was this post helpful?

Blog author

Ihsan Kisi

Do you still have questions? Just send me a message.

fromIhsan Kisi

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 minutes reading time

Ihsan Kisi

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 [Missing String "readingTime"]

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 [Missing String "readingTime"]

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 [Missing String "readingTime"]

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 [Missing String "readingTime"]

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 [Missing String "readingTime"]

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 [Missing String "readingTime"]

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 [Missing String "readingTime"]

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 [Missing String "readingTime"]

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 [Missing String "readingTime"]

Daniel Kocot

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 [Missing String "readingTime"]

Daniel Kocot

An introduction to federated learning in an industrial context: Fundamentals

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Charge your APIs Volume 13: Data meets APIOps

In the swirling digital vortex that modern businesses navigate, two things stand clear as day: our escalating reliance on Application Programming Interfaces (APIs) and the immeasurable value of data. The API Operations (APIOps) pipeline, with its automated...

API
Data

24.8.2023 | 11 [Missing String "readingTime"]

Daniel Kocot

Simple Fraud Detection with PyMC

In one of my last projects, we were facing a prediction problem with very limited data. Each set of data took a specialist hours to compile, and results were not always successful. Therefore, we were looking for a tool to handle these requirements, as...

Python
Data Science

26.1.2023 | 7 [Missing String "readingTime"]

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 [Missing String "readingTime"]

Denis Stalz-John

Money, Money, Money - Monetization of APIs needs more than just a business...

Welcome to my blog series on the topic of my bachelor's thesis, "Real-time dashboard with distributed streaming". To summarize, it discusses the visualization of API-related data that is essential for business owners. How is this series structured? This...

API
Streaming
Data

27.10.2022 | 5 [Missing String "readingTime"]

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 [Missing String "readingTime"]

Denis Stalz-John

BigQuery to the rescue: How to prototype an ML system for a medium-sized...

BigQuery can help with building an ML system for production with a short time to market.Follow industry standards. Agile methods, the MLOps framework and focus on an MVP are helpful.Model improvement is not everything. A good model evaluation as well...

Data

2.2.2022 | 9 [Missing String "readingTime"]

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 [Missing String "readingTime"]

Berthold Schulte

An introduction to federated learning in an industrial context: Advanced

What can we do?

Cryptography-based Techniques

Secure Aggregation

Client Dropouts

Multi-Round Secure Aggregation: Who cooked what?

Homomorphic Encryption

Stages of encryption

The role of Homomorphic encryption in Federated learning

Limitations of Homomorphic encryption:

Non-cryptographic Techniques:

Data anonymization techniques:

Differential Privacy

Limitations

Conclusion

Was this post helpful?

Blog author

More articles

An introduction to federated learning in an industrial context: Fundamentals

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 13: Data meets APIOps

Simple Fraud Detection with PyMC

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Money, Money, Money - Monetization of APIs needs more than just a business...

Python on an M1 chip: Running smoothly using Docker

BigQuery to the rescue: How to prototype an ML system for a medium-sized...

Evaluating machine learning models: Establishing quality gates