Is synthetic data more privacy compliant?
Federated learning , also known as Collaborative Learning, or Privacy preserving Machine Learning, enables multiple entities who do not trust each other (fully), to collaborate in training a Machine Learning (ML) model on their combined dataset; without actually sharing data — addressing critical issues such as privacy, access rights and access to heterogeneous confidential data.
This is in contrast to traditional (centralized) ML techniques where local datasets (belonging to different entities) need to be first brought to a common location before model training. Its applications are spread over a number of industries including defense, telecommunications, healthcare, advertising  and Chatbots .
ML Attack Scenarios
Let us now focus on the ML related privacy risks [4, 5]. Fig. 1 illustrates the attack scenarios in a ML context. In this setting, there are mainly two broad categories of inference attacks: membership inference and property inference attacks. A membership inference attack refers to a basic privacy violation, where the attacker’s objective is to determine if a specific user data item was present in the training dataset. In property inference attacks, the attacker’s objective is to reconstruct properties of a participant’s dataset.
When the attacker does not have access to the model training parameters, it is only able to run the models (via an API) to get a prediction/classification. Black box attacks  are still possible in this case where the attacker has the ability to invoke/query the model, and observe the relationships between inputs and outputs.
Secure Multiparty Computation (SMC)
Given the above attack scenarios, let us now focus on the applicable privacy preserving approaches.
Federated Learning builds on a large corpora of existing research in the field of Secure Multiparty Computation (SMC). SMC allows a number of mutually distrustful parties to carry out a joint computation of a function of their inputs, while preserving the privacy of the inputs. The two main SMC primitives are: Homomorphic Encryption and Secret Sharing. Both the schemes have their pros and cons when it comes to securely computing basic arithmetic operations, such as addition and multiplication.
Homomorphic encryption schemes allow arithmetic operations to be performed locally on the plaintext values, based on their encrypted values. In secret sharing schemes on the other hand, while addition can be performed locally by an addition of the local (plaintext) shares, multiplication requires distributed collaboration among the parties.
It is difficult to theoretically compare the performance of protocols based on the two schemes. For instance,  provides a performance comparison of the two schemes for a secure comparison protocol.
Let E() and D() denote encryption and decryption, respectively, in the homomorphic encryption system. We require the homomorphic property to allow (modular) addition of the plaintexts. It then holds that
From which by simple arithmetic it follows that
The homomorphic encryption system is public-key, i.e. any party can perform the encryption operation E() (by itself). In a threshold encryption system the decryption key is replaced by a distributed protocol. Let m be the number of parties. Only if t ≤ m or more parties collaborate they can perform a decryption. No coalition of less than t parties can decrypt a ciphertext.
Secret sharing refers to a method for distributing a secret amongst a group of parties, each of which is allocated a share of the secret. The secret can be reconstructed only when the shares are combined together (individual shares are of no use on their own). In Shamir’s secret sharing scheme, the sharing of a secret x is achieved as follows: Each party Xi holds a value
where f is a random t−degree polynomial subject to the condition that
It is easy to extend Shamir secret sharing to let the parties compute any linear combination of secrets without gaining information on intermediate results of the computation. To add (subtract) two shared secrets together, the players need only add (subtract) together individual shares at each evaluation point. Computing the product of two secrets is not so trivial, but it is still possible to reduce it to a linear computation. Thus, it is possible to compute any “arithmetic” function (i.e., function involving only addition, subtraction, and multiplication) of secrets securely and robustly.
Privacy-preserving training of Neural Networks
The advantage of Deep Learning (DL) is that the program selects the feature set by itself without supervision, i.e. feature extraction is automated. This is achieved by training large-scale neural networks, referred to as Deep Neural Nets (DNNs) over large labeled datasets.
Training a DNN occurs over multiple iterations (epochs). Each forward run is coupled with a feedback loop, where the classification errors identified at the end of a run with respect to the ground truth (training dataset) is fed back to the previous (hidden) layers to adapt their parameter weights — ‘backpropagation’. A sample DNN architecture is illustrated in Fig. 2.
A privacy preserving extension of the above NN training would average the locally trained models — to obtain the global NN model . The distributed architecture is illustrated in Fig. 3.
As explained above, the averaging can be performed by a Secret Sharing protocol, with the global model hosted by a Coordinating Server. Once trained, we can apply a Homomorphic Compiler (e.g. zama.ai) to output an encrypted model that can accept encrypted inputs, and also provide the model inference (e.g. prediction, classification) as an encrypted output value.
A privacy preserving ML pipeline can be designed using Secret Sharing for model training and Homomorphic Encryption for the inference part.
The main caveat of the above architecture is that the locally trained models need to be shared, which may still contain proprietary information or leak insights related to the local dataset . This is because (during backpropagation) gradients of a given layer of a neural network are computed using the layer’s feature values and the error from the next layer. For example, in the case of sequential fully connected layers,
the gradient of error E with respect toW_l is defined as:
That is, the gradients of W_l are inner products of the error from the next layer and the features h_l ; and hence the correlation between the gradients and features. This is esp. true if certain weights in the weight matrix are sensitive to specific features or values in the participants’ dataset (for example, specific words in a language prediction model ).
To overcome this,  proposes a ‘Secure Aggregation’ protocol, where the aggregating server only learns about model updates in aggregate. The ∝MDL protocol proposed in  further encrypts the gradients using homomorphic encryption. The summation of the encrypted gradients over all participants gives an encrypted global gradient, which can only be decrypted once a threshold number of participants have shared their gradients.  proposes POSEIDON: a Multiparty Homomorphic Encryption based NN training protocol, which (relies on mini-batch gradient descent, and hence) protects the intermediate NN models by maintaining the weights and gradients encrypted throughout the training phase. The protocol can be applied to build different types of layers, such as fully connected, convolution, and pooling. In terms of model accuracy, the authors show that their model performance is comparable to a centrally trained model.
To summarize, this is an active area of research and we will see different SMC protocols capable of training different NN architectures in the near future — with different trade-offs.
Synthetic Data — Privacy
The availability of good quality data (in significant volumes) remains a concern for the success of ML/DL projects. Synthetic data generation aims to provide high quality data that is synthetically generated to closely resemble the original data.
Generative Adversarial Networks (GANs) have proven quite effective for synthetic data generation. Intuitively, a GAN can be considered as a game between two networks: A Generator network and a second Classifier network. A Classifier can, e.g., be a Convolutional Neural Network (CNN) based image classification network; distinguishing samples as either coming from the actual distribution or from the Generator. Every time the Classifier is able to tell a fake image, i.e. it notices a difference between the two distributions; the Generator adjusts its parameters accordingly. At the end (in theory), the Classifier will be unable to distinguish, implying the Generator is then able to reproduce the original data set.
Privacy regulations (e.g. EU GDPR) restrict the Personally Identifiable Information (PII) that can be used for analytics. As such, there has been renewed interest in synthetic data, in its ability generate privacy preserving synthetic data. This implies synthetic data that is close to (and generated based on) the original training data; in such a way that is compliant with privacy regulations; while still allowing similar insights to be derived as could be derived from the original training data.
The premise is promising, and this has been accompanied by very optimistic messaging from both governmental organizations and commercial entities.
- NIST Differential Privacy Synthetic Data Challenge (link): “Propose an algorithm to develop differentially private synthetic datasets to enable the protection of personally identifiable information (PII) while maintaining a dataset’s utility for analysis.”
- Diagnosing the NHS — SynÆ (link): “ODI Leeds and NHS England will be working together to explore the potential of ‘synthetic data.’ This is data that has been created following the patterns identified in a real dataset but it contains no personal data, making it suitable to release as open data.”
- Statice (link): “Statice generates synthetic data — just like real data, but privacy-compliant”
- Hazy (link): “Hazy’s synthetic data generation lets you create business insight across company, legal and compliance boundaries — without moving or exposing your data.”
While the promise of privacy preserving synthetic data is valid, the truth is that such claims need to be taken with a ‘grain of salt’ — as there are numerous challenges currently to both making and evaluating such claims. For example, there is no agreement today (or a standard framework) on even which privacy metric to use to validate such claims.
With current synthetic data generation techniques, the protection level varies by user. It is difficult to predict the features that the model will learn and those that the adversary will attack — due to randomness in the generation algorithms (e.g., GANs)—implying that we cannot guarantee privacy protection for all users.  shows that synthetic data generated by a number of generative models actually leak more information, i.e. they perform worse than the original (training) dataset with respect to privacy metrics, e.g., Linkability and Attribute Inference.
- Li, T., Sahu, A.K., Talwalkar, A., & Smith, V. (2020). Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine, 37, 50–60.
- D. Biswas, S. Haller and F. Kerschbaum. Privacy-Preserving Outsourced Profiling. 12th IEEE Conference on Commerce and Enterprise Computing, Shanghai, 2010, pp. 136–143, doi: 10.1109/CEC.2010.39.
- D. Biswas. Privacy preserving Chatbot Conversation. 3rd NeurIPS Workshop on Privacy-preserving Machine Learning (PPML), 2020 (Medium), https://ppml-workshop.github.io/pdfs/Biswas.pdf
- M. Rigaki and S. Garcia. A Survey of Privacy Attacks in Machine Learning. 2020, https://arxiv.org/abs/2007.07646
- C. Briggs, Z. Fan, and P. Andras. A Review of Privacy-preserving Federated Learning for the Internet-of-Things, 2020, https://arxiv.org/abs/2004.11794
- A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box Adversarial Attacks with Limited Queries and Information. In Proceedings of the 35th International Conference on Machine Learning, pages 2137–2146. PMLR, 2018, http://proceedings.mlr.press/v80/ilyas18a.html.
- F. Kerschbaum, D. Biswas and S. de Hoogh. Performance Comparison of Secure Comparison Protocols. 20th International Workshop on Database and Expert Systems Application, Linz, 2009, pp. 133–136, doi: 10.1109/DEXA.2009.37.
- H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas. Federated Learning of Deep Networks using Model Averaging. CoRR, abs/1602.05629, 2016.
- Nasr, M., Shokri, R., & Houmansadr, A. (2019). Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning. 2019 IEEE Symposium on Security and Privacy (SP), 739–753.
- H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data, 2017, https://arxiv.org/abs/1602.05629.
- K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, page 1175–1191, 2017, https://dl.acm.org/doi/10.1145/3133956.3133982.
- X. Zhang, S. Ji, H. Wang, and T. Wang. Private, Yet Practical, Multiparty Deep Learning. In 37th IEEE International Conference on Distributed Computing Systems (ICDCS), 2017, pp. 1442–1452.
- Sav, S., Pyrgelis, A., Troncoso-Pastoriza, J., Froelicher, D., Bossuat, J., Sousa, J.S., and Hubaux, J. POSEIDON: Privacy-Preserving Federated Neural Network Learning, 2020,https://arxiv.org/abs/2009.00349
- Stadler, T., Oprisanu, B., and Troncoso, C. Synthetic Data — A Privacy Mirage, 2020, https://arxiv.org/abs/2011.07018.