Assessing Identity Disclosure Risk in the Absence of Identified Datasets in the Public Domain

  • Peter N. Muturi University of Nairobi
  • Andrew M. Kahonge University of Nairobi
  • Christopher Kipchumba Chepken University of Nairobi
Keywords: Anonymisation, De-Identification, Re-Identification, Privacy, Data Release, Data Analytics, Analytical Utility
Share Article:

Abstract

Data release is essential in supporting data analytics and secondary data analyses. However, data curators need to ensure the released datasets preserve data subjects’ privacy and retain analytical utility. Data privacy is achieved through the anonymisation of datasets before release. The risk of disclosure posed to the dataset should inform the level of anonymisation to be undertaken. As anonymisation achieves data privacy, it reduces the analytical utility of the dataset by introducing alterations to the original data values. Therefore, data curators require an appropriate estimate of the dataset’s identity disclosure risk to inform the required anonymisation that balances privacy and utility. The disclosure risk varies from one geographical region to another due to varying enabling factors. This paper assesses the disclosure risk and the enabling factors in an environment lacking identified datasets in the public domain. This study used a quasi-experimental design in carrying out an empirical identity disclosure test, where respondents were given an anonymised dataset and were required to disclose the identity of any of the records. The findings were that background knowledge of the released datasets was the primary enabler in the absence of identified datasets. Respondents could only disclose records in the dataset they had familiarity with. However, the disclosure risk was within an acceptable threshold. Therefore, the study concluded that in an environment lacking identified datasets in the public domain, reasonable anonymisation could achieve a balance of privacy and utility in datasets. The findings justify private data release able to support data analytics and secondary data analyses in environments lacking identified datasets in the public domain.

Downloads

Download data is not yet available.

References

Alfalayleh, M., & Brankovic, L. (2015). Quantifying privacy: A novel entropy-based measure of disclosure risk. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8986, 24–36. https://doi.org/10.1007/978-3-319-19315-1_3

Andreou, A., Goga, O., & Loiseau, P. (2017). Identity vs. Attribute disclosure risks for users with multiple social profiles. Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017, 163–170. https://doi.org/10.1145/3110025.3110046

Antoniou, A., Dossena, G., Macmillan, J., Hamblin, S., Clifton, D., & Petrone, P. (2022). Assessing the risk of re-identification arising from an attack on anonymised data. In arXiv:2203.16921. https://arxiv.org/ftp/arxiv/papers/2203/2203.16921.pdf

Asikis, T., & Pournaras, E. (2020). Optimisation of privacy-utility tradeoffs under informational self-determination. Future Generation Computer Systems, 109, 488–499. https://doi.org/10.1016/j.future.2018.07.018

Assuncao, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., Buyya, R., Avery, A. A., Cheek, K., Bailey, D., Bailey, S., Bâra, A., Lungu, I., Bari, A., Chaouchi, M., Jung, T., For, A., Berk, Bordawekar, R., Blainey, B., Apte, C., … Zementis. (2016). Why Your Next Data Warehouse should be in the Cloud. Going Pro in Data Science, 25186(June), 1–7. https://doi.org/10.1126/science.Liquids

Bambauer, J., Muralidhar, K., & Sarathy, R. (2014). Fool’ s Gold : An Illustrated Critique of Differential Privacy. Vanderbilt Journal of Entertainment & Technology Law, 16(4), 701–755. https://scholarship.law.vanderbilt.edu/jetlaw/vol16/iss4/1/

Bandara, P. L. M. K., Bandara, H. D., & Fernando, S. (2020). Evaluation of Re-identification Risks in Data Anonymization Techniques Based on Population Uniqueness. Proceedings of ICITR 2020 - 5th International Conference on Information Technology Research: Towards the New Digital Enlightenment. https://doi.org/10.1109/ICITR51448.2020.9310884

Benitez, K., & Malin, B. (2010). Evaluating re-identification risks with respect to the HIPAA privacy rule. Journal of the American Medical Informatics Association, 17(2), 169–177. https://doi.org/10.1136/jamia.2009.000026

Branson, J., Good, N., Chen, J. W., Monge, W., Probst, C., & El Emam, K. (2020). Evaluating the re-identification risk of a clinical study report anonymised under EMA Policy 0070 and Health Canada Regulations. Trials, 21(1), 1–9. https://doi.org/10.1186/s13063-020-4120-y

Cavoukian, A., & Reed, D. (2013). Big Privacy: Bridging Big Data and the Personal Data Ecosystem Through Privacy by Design. In Information and Privacy Commissioner of Ontario, Canada (Issue December). www.ipc.on.ca/images/Resources/pbd-big_privacy.pdf

Chakravorty, A., Wlodarczyk, T., & Rong, C. (2013). Privacy preserving data analytics for smart homes. Proceedings - IEEE CS Security and Privacy Workshops, SPW 2013, 23–27. https://doi.org/10.1109/SPW.2013.22

Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112(1).

Dankar, Fida K., & El Emam, K. (2010). A Methods of Evaluating Marketer Re-Identification Risk. ACM International Conference Proceeding Series.

Dankar, Fida Kamal, El Emam, K., Neisa, A., & Roffey, T. (2012). Estimating the re-identification risk of clinical data sets. BMC Med Inform Decis Mak, 12(September 2009), 66. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3583146/?tool=pubmed%5Cnhttp://dx.doi.org/10.1186/1472-6947-12-66

Domingo-ferrer, J., Ricci, S., & Soria-coma, J. (2017). Empirical Comparison of Anonymisation Methods Regarding Their Risk-Utility Tradeoff. International Conference on Modeling Decisions for Artificial Intelligence, 1–15. https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2017/3_utility_risk.pdf

Domingo-Ferrer, J., & Torra, V. (2003). Disclosure risk assessment in statistical microdata protection via advanced record linkage. Statistics and Computing, 13(4), 343–354. https://doi.org/10.1023/A:1025666923033

Domingo-Ferrer, J., & Torra, V. (2004). Disclosure risk assessment in statistical data protection. Journal of Computational and Applied Mathematics, 164–165, 285–293. https://doi.org/10.1016/S0377-0427(03)00643-5

El Emam, K., & Alvarez, C. (2015). A critical appraisal of the Article 29 Working Party Opinion 05/2014 on data anonymisation techniques. International Data Privacy Law, 5(1), 73–87. https://doi.org/10.1093/idpl/ipu033

El Emam, K., Buckeridge, D., Tamblyn, R., Neisa, A., Jonker, E., & Verma, A. (2011). The re-identification risk of Canadians from longitudinal demographics. BMC Medical Informatics and Decision Making, 11(1). https://doi.org/10.1186/1472-6947-11-46

El Emam, K., Dankar, F. K., Vaillancourt, R., Roffey, T., & Lysyk, M. (2009). Evaluating the risk of re-identification of patients from hospital prescription records. Canadian Journal of Hospital Pharmacy, 62(4), 307–319. https://doi.org/10.4212/cjhp.v62i4.812

El Emam, K., & Hassan, W. (2016). The De-identification Maturity. https://www.himss.org/privacy-analytics-de-identification-maturity-model

El Emam, K., Jonker, E., Arbuckle, L., & Malin, B. (2011). A systematic review of re-identification attacks on health data. PLoS ONE, 6(12). https://doi.org/10.1371/journal.pone.0028071

Emam, K. El, Mosquera, L., & Bass, J. (2020). Evaluating identity disclosure risk in fully synthetic health data: model development and validation. Journal of Medical Internet Research, 22(11), 1–14. https://doi.org/10.2196/23139

Emam, K. (2013). Measuring the Probability of Re-­Identification. In Guide to the De-Identification of Personal Health Information (pp. 177–196). https://doi.org/10.1201/b14764-20

Erdélyi, Á., Winkler, T., & Rinner, B. (2018). Privacy protection vs. utility in visual data: An objective evaluation framework. Multimedia Tools and Applications, 77(2), 2285–2312. https://doi.org/10.1007/s11042-016-4337-7

Farzanehfar, A., Houssiau, F., & de Montjoye, Y. A. (2021). The risk of re-identification remains high even in country-scale location datasets. Patterns, 2(3), 100204. https://doi.org/10.1016/j.patter.2021.100204

Garfinkel, S. L. (2015). NISTIR 8053 De - Identification of Personal Information NISTIR 8053 De - Identification of Personal Information. In National Institute of Standards and Technology. https://doi.org/10.6028/NIST.IR.8053

Johnston, M. P. (2014). Secondary Data Analysis : A Method of which the Time Has Come. Qualitatve and Quantative Methods in Libraryes (QQML), 3, 619–626.

Khaled El Emam, Lucy Mosquera, R. H. (2020). Practical Synthetic Data Generation. O’Reilly Media, Inc. https://www.oreilly.com/library/view/practical-synthetic-data/9781492072737/

Kniola, L. (2017). Plausible Adversaries in Re-Identification Risk Assessment. Phuse, Paper DH09, 1– 10. https://www.lexjansen.com/phuse/2017/dh/DH09.pdf

Lee, J., & Clifton, C. (2012). Differential identifiability. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’12, 1041. https://doi.org/10.1145/2339530.2339695

Li, N., Li, T., & Venkatasubramania, S. (2007). t -Closeness : Privacy Beyond k -Anonymity and -Diversity. IEEE 23rd International Conference, 3, 106–115. https://doi.org/10.1109/ICDE.2007.367856

Machanavajjhala, A., Kifer, D., Gehrhe, J., & VENKITASUBRAMANIAM, M. (2006). L-Diversity : Privacy Beyond k -Anonymity. Proceedings of the 22nd International Conference on Data Engineering, 1–36.

Mitchell, A. (2012). From data hoarding to data sharing. Journal of Direct, Data and Digital Marketing Practice, 13(4), 325–334. https://doi.org/10.1057/dddmp.2012.3

Narayan, A. (2015). Distributed Differential Privacy and Applications.

Nelson, G. S. (2015). Practical Implications of Sharing Data: A Primer on Data Privacy, Anonymization, and De-Identification. SAS® Global Forum 2015, April 2015, 23. http://support.sas.com/resources/papers/proceedings15/1884-2015.pdf

Nissim, K., Steinke, T., Wood, A., Altman, M., Bembenek, A., Bun, M., Gaboardi, M., O ’brien, D. R., & Vadhan, S. (2017). Differential Privacy: A Primer for a Non-technical Audience * (Preliminary version). 1237235.

Parliament, K. (2019). The Data Protection Act. In National Council for Law Reporting. https://doi.org/10.1088/0031-9112/37/4/026

Quach, S., Thaichon, P., Martin, K. D., Weaven, S., & Palmatier, R. W. (2022). Digital technologies: tensions in privacy and data. Journal of the Academy of Marketing Science. https://doi.org/10.1007/s11747-022-00845-y

Ramachandran, A., Singh, L., Porter, E., & Nagle, F. (2012). Exploring re-identification risks in public domains. 2012 10th Annual International Conference on Privacy, Security and Trust, PST 2012, 35– 42. https://doi.org/10.1109/PST.2012.6297917

Reddy, S., & Prakash, O. (2014). UTILITY-PRIVACY TRADEOFF IN DATABASES : AN INFORMATION THEORETIC APPROACH. International Journal of Engineering & Science Research, 4(10), 608–612.

Reiter, J. P. (2015). Estimating Risks of Identification Disclosure in Microdata. Journal of the American Statistical Association, 100(472), 1103–1112.

Ribeiro, S. L., & Nakamura, E. T. (2019). Privacy Protection with Pseudonymization and Anonymization in a Health IoT System: Results from OCARIoT. Proceedings - 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering, BIBE 2019, 904–908. https://doi.org/10.1109/BIBE.2019.00169

Rocher, L., Hendrickx, J. M., & de Montjoye, Y.-A. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications, 10(1). https://doi.org/10.1038/s41467-019-10933-3

Rocher, L., Hendrickx, J. M., & Montjoye, Y. De. (n.d.). Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications, 2019. https://doi.org/10.1038/s41467-019-10933-3

Santu, S. K. K., Bindschadler, V., Zhai, C., & Gunter, C. A. (2018). NRF : A Naive Re-identification Framework. Proceedings of the 2018 Workshop on Privacy in the Electronic Society (WPES’18), 121–132.

Scaiano, M., Middleton, G., Arbuckle, L., Kolhatkar, V., Peyton, L., Dowling, M., Gipson, D. S., & El Emam, K. (2016). A unified framework for evaluating the risk of re-identification of text de-identification tools. Journal of Biomedical Informatics, 63, 174–183. https://doi.org/10.1016/j.jbi.2016.07.015

Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., & Tufano, P. (2012). Analytics: The real-world use of big data. IBM Global Business Services Saïd Business School at the University of Oxford, 1–20. http://www-03.ibm.com/systems/hu/resources/the_real_word_use_of_big_data.pdf

Shlomo, N. (2009). Releasing microdata: disclosure risk estimation, data masking and assessing utility. Journal of Privacy and Confidentiality, 1, 229–240. http://eprints.soton.ac.uk/65423/

Simon, G. E., Shortreed, S. M., Coley, R. Y., Penfold, R. B., Rossom, R. C., Waitzfelder, B. E., Sanchez, K., & Lynch, F. L. (2019). Assessing and Minimising Re-identification Risk in Research Data Derived from Health Care Records. EGEMs (Generating Evidence & Methods to Improve Patient Outcomes), 7(1), 1–9. https://doi.org/10.5334/egems.270

Sweeney, L. (2000). Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000, 1–34. http://dataprivacylab.org/projects/identifiability/paper1.pdf

Sweeney, L., Loewenfeldt, M. V, & Perry, M. (2018). Saying it’s anonymous doesn’t make it so: Re-identifications of {textquotedblleft} anonymized{textquotedblright} law school data. Technology Science, 2018111301. https://techscience.org/a/2018111301

Sweeney, L., Yoo, J. S., Perovich, L., Boronow, K. E., Brown, P., & Brody, J. G. (2017). Re-identification risks in HIPAA Safe Harbor data: a study of data from one environmental health study. Technol Sci, 2017:20170.

Sweeny, L. (2002). k- ANONYMITY: A MODEL FOR PROTECTING PRIVACY 1. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557–570. https://doi.org/10.1142/S0218488502001648

Templ, M., Kowarik, A., & Meindl, B. (2015). Statistical disclosure control for micro-data using the R package sdcMicro. Journal of Statistical Software, 67(4). https://doi.org/10.18637/jss.v067.i04

Wickham, R. J. (2019). Secondary Analysis Research. Journal of the Advanced Practitioner in Oncology, 10(4), 395–400. https://doi.org/10.6004/jadpro.2019.10.4.7

Xia, W., Liu, Y., Wan, Z., Vorobeychik, Y., Kantacioglu, M., Nyemba, S., Clayton, E. W., & Malin, B. A. (2021). Enabling realistic health data re-identification risk assessment through adversarial modeling. Journal of the American Medical Informatics Association : JAMIA, 28(4), 744–752. https://doi.org/10.1093/jamia/ocaa327

Yao, X., Zhou, X., & Ma, J. (2016). Differential Privacy of Big Data: An Overview. 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), 9(2), 7–12. https://doi.org/10.1109/BigDataSecurity-HPSC-IDS.2016.9

Yoo, J. S., Thaler, A., Sweeney, L., & Zang, J. (2018). Risks to Patient Privacy : A Re- identification of Patients in Maine and Vermont Statewide Hospital Data. Technology Science, 1–62. https://techscience.org/a/2018100901/

Published
5 August, 2022
How to Cite
Muturi, P., Kahonge, A., & Chepken, C. (2022). Assessing Identity Disclosure Risk in the Absence of Identified Datasets in the Public Domain. East African Journal of Information Technology, 5(1), 62-75. https://doi.org/10.37284/eajit.5.1.773