Background
Though many datasets contain sensitive, private information, they are nevertheless often released due to the benefits they can provide to researchers and the general public. For example, medical datasets are published to allow researchers to develop improved data-driven medical diagnosis techniques. Due to the sensitive information they contain (e.g. medical records), these datasets should be sufficiently anonymised before release – ensuring that no persons contained within the dataset are harmed by the release.
Data de-identification is often the default technique for anonymising data. This approach consists of removing any obvious “personally identifiable information” (Pii) from the dataset. For example, names, addresses, and date-of-births are often obscured. However, recent privacy attacks and advancements in privacy-preserving research have revealed that datasets anonymised via de-identification can be often compromised via “linkage attacks”. In a linkage attack, the published data is de-anonymised by linking it to auxiliary information obtained from another source. A famous example is the Netflix/IMDB linkage attack [1, 2], where persons in an anonymised database of Netflix movie ratings were re-identified by linking the Netflix data to IMBD.
In this challenge, you are provided with a set of sensitive datasets that have been anonymised via de-identification before publication. You will investigate whether this de-identification is robust to privacy attacks. Your goal is to de-anonymise the sensitive data by linking it back to some Pii data, thus revealing the private information of individuals in the Pii dataset and exposing any flaws in the process used to anonymise the sensitive data. The team that successfully de-anonymises the most data wins!
This challenge will demonstrate the limitations of data de-identification methods and thus promote an appreciation of the utility of more sophisticated Privacy Enhancing Technologies (PETs), such as differential privacy (for more details on differential privacy, see Challenge 1). It is also an excellent opportunity for participants to practice and improve upon their Python data manipulation and data analysis skills.
Requirements
Teams will require a minimum of 1 person with some technical knowledge of manipulating data and analysing data with python and pandas. We would encourage a minimum of 2 per team, in the spirit of collaboration.
Each team must have a laptop with an internet connection to compete in the challenge. The challenge can be performed online via Google Colab (so no software need to be installed in advance). To run on Colab at least one member of the team will need a Google account.
Pre-Challenge Materials
Reading Materials
- This blog post on Oblivious’ site gives a nice introduction to attacks on privacy.
- The programming differential privacy guide is helpful for the implementation of differentially private mechanisms. The De-identification notebook is particularly relevant to this challenge.
- The PPML playlist created by CeADAR gives a high-level outline of how privacy-preserving techniques can be applied to data.
- For a deeper dive into differential privacy and privacy attacks, see Section 1 of The Algorithmic Foundations of Differential Privacy, and Section 1 of Exposed! A Survey of Attacks on Private Data.
- You should brush up on your python and pandas skills before the challenge. See the 10 minutes to pandas tutorial if you have little prior pandas experience. Also, see the ‘Example Code’ section below.
Example Code
We provide an example notebook to introduce you to linkage attacks, and the python and pandas operations required to implement them. We recommend participants implement Challenge 2 via Colab, so it is worth familiarising yourself with its workings.
Instructions for running notebook online on Colab:
- Open toy_linkage_attack.ipynb
- If you wish to save your progress:
- Click “Changes will not be saved”
- Click “Save a copy in drive”:
- Change to the “Copy of toy_linkage_attack.ipynb” notebook
- Click “Changes will not be saved”
- Begin running the notebook!