In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous model attacks can jeopardize the training data, from membership inference attacks to property inference attacks (Stock et al.). While we won't dive into the specifics of these attacks, the following example illustrates how models can memorize data.
As you can see from the image (Ziegler et al. 13), the researchers were able to reconstruct the member of the input dataset. This was done by “applying reconstruction attacks to the local model updates from individual clients” (Ziegler et al. 1).
If you have not read the first part of this blog series, where I explain what federated learning is in an industrial context, here is the link to part 1.
What can we do?
There are many mitigation techniques to reduce the effectiveness of such attacks. Usually in practice, a combination of them will be used to maximize privacy, while also maintaining a feasible training time and accuracy. It is possible to group the available techniques into two broad categories. There are cryptography-based techniques, like homomorphic encryption and secure aggregation, and non-cryptographic methods, such as differential privacy and data anonymization.
It's worth pointing out that cryptography-based techniques are the focus of extensive research, and improvements are being made constantly. Let's have a look at what they are.
Secure Aggregation is essentially a protocol to hide the origin of the model parameters and thus protects the individual clients. Its importance can be seen from the attacks described above. Parameters can be sensitive information about a client, so hiding its origin increases the privacy and reduces the attack surface. The available protocols include but are not limited to SecAgg/SecAgg+/LightSecAgg/FastSecAgg. Each protocol has its own pros and cons, the discussion of which is beyond the scope of this article.
A secure aggregation can be achieved by shuffling the updates among the clients so that at the time of aggregation, the server does not know where the updates originated from. This shuffling uses mathematical methods to ensure that all the updates can be reconstructed when the updates reach the server. Another way to implement secure aggregation is by using a trusted execution environment as a secure worker, where only the secure worker sees the origin of the updates and the server sees the final result. This approach has its own pros and cons but the rest of the section will focus more on the first technique since it is the most widespread approach.
One of the biggest challenges with secure aggregation is client dropouts and the consequences that arise from the dropouts. First, let's understand why dropouts are a problem at all and what consequences arise from this. Afterwards, we will have a quick look at why secure aggregation over multiple rounds is not very feasible with the traditional secure aggregation protocols we have.
It is often assumed that the clients have perfect availability, which is not always the case. Energy shortages, lack of internet access, and device maintenance are just a few of the reasons why clients might not always be available and drop out. In these situations, we can not access the client and cannot perform any training on it. As a result, the client is unable to share any parameters that it received from other clients in secure aggregation.
Multi-Round Secure Aggregation: Who cooked what?
Let's start off with the bad news: most secure aggregation protocols are not secure over multiple rounds. This problem is specific to the clients shuffling approach but it is a big one nevertheless. Most secure aggregation algorithms only consider how to shuffle the weights securely and hide the origin in a single round. The idea is that if we can get one round correct, then in a multi-round context, the process will just be repeated. Unfortunately, however, what happens is that in a multi-round context, the client dropouts have to be considered too. A simple analogy to describe this situation is cooking. Imagine that the tester is the server and there are three people (clients) who make three different pasta sauces: pesto, bolognese, and tomato sauce. In the first round, you sample all the sauces, not knowing which client made which sauce. In the following round, the person who cooked the pesto sauce is sick and cannot present their sauce. The other individuals tweak their sauces slightly before presenting them again. Even though you don't know who cooked which sauce, you're now aware that pesto is missing and can identify who was absent (a client dropout). This allows you to infer a correlation between the client and their updates. This is the conundrum that many secure aggregation algorithms face. While emerging algorithms aim to address this issue, they have yet to be adopted by any frameworks at the time of writing, according to my research.
Stages of encryption
Encryption can occur in three distinct stages: at-rest, in-transit, and in-use. We use the first two heavily in our lives, sometimes without even knowing it.
At-rest encryption refers to the process in which a file, once encrypted, lies on the disk until decryption for use. Encryption in-transit describes encryption when two devices are communicating, so for example when you visit a website, TLS ensures that your connection stays secure during transit and no one can intercept the traffic. Encryption in-use describes the concept of being able to do calculations on encrypted data. Until quite recently, the in-use methods were not practical and you will see why in a second.
The role of Homomorphic encryption in Federated learning
Homomorphic encryption allows the entire aggregation process to take place on encrypted data. The server/aggregator never sees the model updates and also, unlike secure aggregation, the global model. The clients send their encrypted model updates to the server and then the server will aggregate the weights and send the global model, again encrypted, back to the clients. In the standard homomorphic encryption protocol, the clients share a private key and the public key is shared with the aggregator. The below diagram (Authors Work) demonstrates how the process takes place.
The specific implementation details are beyond the scope of this article but I encourage you to check out NVFlare’s implementation if you are looking for inspiration. (https://developer.nvidia.com/flare)
Limitations of Homomorphic encryption:
Using Homomorphic encryption is computationally expensive and requires significantly more storage space. Over the years, it has become more practical but it is still resource intensive. However, limitations are not limited to the complexity and storage space it requires; there are also technical limitations and security considerations that have to be taken into account.
Data anonymization techniques:
Anonymized data can never be completely anonymous. Research has repeatedly shown that anonymized datasets can be deanonymized with the help of some basic information about a person. In 2019, for example, the researchers from Imperial College London and Belgium’s Université Catholique de Louvain (UCLouvain) built a model that allowed them to correctly re-identify 99.98% of Americans in any dataset using 15 demographic attributes (Rocher et al. 2019). This example and others, such as the Netflix dataset 2008 case (Narayanan and Shmatikov), demonstrate that we can not fully trust the anonymization methods. At this point, I'd encourage you to take the “How unique am I?” test from AboutMyInfo.org to get a sense of how easy it is to deanonymize data and identify an individual in a group.
But, what if we could know how much privacy would be lost in the worst-case scenario?
Differential privacy is all about taking calculated risks. It allows us to quantify, using the privacy budget ε, the maximum amount of privacy loss incurred when a differential privacy algorithm is used. Then we start adding noise to the data and stop when we find a reasonable balance between the privacy loss and the useful information being provided to the model. Here, with the data, our focus is on the distribution of what is related and not the individual data points. Adding noise will almost always reduce the accuracy of the model but some accuracy loss is most of the time tolerable, especially considering the benefits of increased privacy. Another benefit of adding noise is that it might help avoid overfitting since the influence of a single data point will be less after the noise. This is good because we are usually always interested in robust models that learn the distribution and generalize well.
Differential privacy is only helpful for adding noise but not for protecting the entire training set. As mentioned earlier, it might also reduce the accuracy. If such concerns are relevant, then cryptography-based techniques might be better suited for your use case. As mentioned earlier, differential privacy can also come at the expense of model accuracy, especially with underrepresented subgroups.
As you can see, federated learning is a paradigm shift that shapes how we handle data. It is not a cutting edge approach that only exists in research papers. While it is still being extensively researched and pushed to its limits, it is already usable today. It has a huge community of practitioners and researchers around the world that make it possible to use it easily using frameworks.
As with any emerging technology, navigating the landscape will require a mix of technical understanding and careful risk management. There is a delicate balancing of security, privacy and usability of the data. In an industrial context, questions such as sensitivity of data, availability of computational resources as well as the threat model of your organization are especially relevant. It is also important to point out that without traditional security measures, the advanced techniques to protect the dataset are ineffective.
I hope the introduction was clear and you're now ready to unlock the potential of your data. If you have any questions or suggestions, don't hesitate to reach out!
- Narayanan, Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE, 2008.
- Stock, Joshua, et al. "Studie zur fachlichen Einschätzung und Prüfung des Potenzials von Federated-Learning-Algorithmen in der amtlichen Statistik." (2023).
- Rocher, Luc, Julien Hendrickx, and Yves-Alexandre Montjoye. "Estimating the success of re-identifications in incomplete datasets using generative models." Nature Communications, vol. 10, 2019, doi: 10.1038/s41467-019-10933-3.
- Ziegler, Joceline, et al. "Defending against reconstruction attacks through differentially private federated learning for classification of heterogeneous chest x-ray data." Sensors 22.14 (2022): 5195.
Your job at codecentric?
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.