Assessing Identity Disclosure Risk in the Absence of Identified Datasets in the Public Domain

  • Peter N. Muturi University of Nairobi
  • Andrew M. Kahonge University of Nairobi
  • Christopher Kipchumba Chepken University of Nairobi
Keywords: Anonymisation, De-Identification, Re-Identification, Privacy, Data Release, Data Analytics, Analytical Utility
Share Article:


Data release is essential in supporting data analytics and secondary data analyses. However, data curators need to ensure the released datasets preserve data subjects’ privacy and retain analytical utility. Data privacy is achieved through the anonymisation of datasets before release. The risk of disclosure posed to the dataset should inform the level of anonymisation to be undertaken. As anonymisation achieves data privacy, it reduces the analytical utility of the dataset by introducing alterations to the original data values. Therefore, data curators require an appropriate estimate of the dataset’s identity disclosure risk to inform the required anonymisation that balances privacy and utility. The disclosure risk varies from one geographical region to another due to varying enabling factors. This paper assesses the disclosure risk and the enabling factors in an environment lacking identified datasets in the public domain. This study used a quasi-experimental design in carrying out an empirical identity disclosure test, where respondents were given an anonymised dataset and were required to disclose the identity of any of the records. The findings were that background knowledge of the released datasets was the primary enabler in the absence of identified datasets. Respondents could only disclose records in the dataset they had familiarity with. However, the disclosure risk was within an acceptable threshold. Therefore, the study concluded that in an environment lacking identified datasets in the public domain, reasonable anonymisation could achieve a balance of privacy and utility in datasets. The findings justify private data release able to support data analytics and secondary data analyses in environments lacking identified datasets in the public domain.


Download data is not yet available.


Alfalayleh, M., & Brankovic, L. (2015). Quantifying privacy: A novel entropy-based measure of disclosure risk. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8986, 24–36.

Andreou, A., Goga, O., & Loiseau, P. (2017). Identity vs. Attribute disclosure risks for users with multiple social profiles. Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017, 163–170.

Antoniou, A., Dossena, G., Macmillan, J., Hamblin, S., Clifton, D., & Petrone, P. (2022). Assessing the risk of re-identification arising from an attack on anonymised data. In arXiv:2203.16921.

Asikis, T., & Pournaras, E. (2020). Optimisation of privacy-utility tradeoffs under informational self-determination. Future Generation Computer Systems, 109, 488–499.

Assuncao, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., Buyya, R., Avery, A. A., Cheek, K., Bailey, D., Bailey, S., Bâra, A., Lungu, I., Bari, A., Chaouchi, M., Jung, T., For, A., Berk, Bordawekar, R., Blainey, B., Apte, C., … Zementis. (2016). Why Your Next Data Warehouse should be in the Cloud. Going Pro in Data Science, 25186(June), 1–7.

Bambauer, J., Muralidhar, K., & Sarathy, R. (2014). Fool’ s Gold : An Illustrated Critique of Differential Privacy. Vanderbilt Journal of Entertainment & Technology Law, 16(4), 701–755.

Bandara, P. L. M. K., Bandara, H. D., & Fernando, S. (2020). Evaluation of Re-identification Risks in Data Anonymization Techniques Based on Population Uniqueness. Proceedings of ICITR 2020 - 5th International Conference on Information Technology Research: Towards the New Digital Enlightenment.

Benitez, K., & Malin, B. (2010). Evaluating re-identification risks with respect to the HIPAA privacy rule. Journal of the American Medical Informatics Association, 17(2), 169–177.

Branson, J., Good, N., Chen, J. W., Monge, W., Probst, C., & El Emam, K. (2020). Evaluating the re-identification risk of a clinical study report anonymised under EMA Policy 0070 and Health Canada Regulations. Trials, 21(1), 1–9.

Cavoukian, A., & Reed, D. (2013). Big Privacy: Bridging Big Data and the Personal Data Ecosystem Through Privacy by Design. In Information and Privacy Commissioner of Ontario, Canada (Issue December).

Chakravorty, A., Wlodarczyk, T., & Rong, C. (2013). Privacy preserving data analytics for smart homes. Proceedings - IEEE CS Security and Privacy Workshops, SPW 2013, 23–27.

Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112(1).

Dankar, Fida K., & El Emam, K. (2010). A Methods of Evaluating Marketer Re-Identification Risk. ACM International Conference Proceeding Series.

Dankar, Fida Kamal, El Emam, K., Neisa, A., & Roffey, T. (2012). Estimating the re-identification risk of clinical data sets. BMC Med Inform Decis Mak, 12(September 2009), 66.

Domingo-ferrer, J., Ricci, S., & Soria-coma, J. (2017). Empirical Comparison of Anonymisation Methods Regarding Their Risk-Utility Tradeoff. International Conference on Modeling Decisions for Artificial Intelligence, 1–15.

Domingo-Ferrer, J., & Torra, V. (2003). Disclosure risk assessment in statistical microdata protection via advanced record linkage. Statistics and Computing, 13(4), 343–354.

Domingo-Ferrer, J., & Torra, V. (2004). Disclosure risk assessment in statistical data protection. Journal of Computational and Applied Mathematics, 164–165, 285–293.

El Emam, K., & Alvarez, C. (2015). A critical appraisal of the Article 29 Working Party Opinion 05/2014 on data anonymisation techniques. International Data Privacy Law, 5(1), 73–87.

El Emam, K., Buckeridge, D., Tamblyn, R., Neisa, A., Jonker, E., & Verma, A. (2011). The re-identification risk of Canadians from longitudinal demographics. BMC Medical Informatics and Decision Making, 11(1).

El Emam, K., Dankar, F. K., Vaillancourt, R., Roffey, T., & Lysyk, M. (2009). Evaluating the risk of re-identification of patients from hospital prescription records. Canadian Journal of Hospital Pharmacy, 62(4), 307–319.

El Emam, K., & Hassan, W. (2016). The De-identification Maturity.

El Emam, K., Jonker, E., Arbuckle, L., & Malin, B. (2011). A systematic review of re-identification attacks on health data. PLoS ONE, 6(12).

Emam, K. El, Mosquera, L., & Bass, J. (2020). Evaluating identity disclosure risk in fully synthetic health data: model development and validation. Journal of Medical Internet Research, 22(11), 1–14.

Emam, K. (2013). Measuring the Probability of Re-­Identification. In Guide to the De-Identification of Personal Health Information (pp. 177–196).

Erdélyi, Á., Winkler, T., & Rinner, B. (2018). Privacy protection vs. utility in visual data: An objective evaluation framework. Multimedia Tools and Applications, 77(2), 2285–2312.

Farzanehfar, A., Houssiau, F., & de Montjoye, Y. A. (2021). The risk of re-identification remains high even in country-scale location datasets. Patterns, 2(3), 100204.

Garfinkel, S. L. (2015). NISTIR 8053 De - Identification of Personal Information NISTIR 8053 De - Identification of Personal Information. In National Institute of Standards and Technology.

Johnston, M. P. (2014). Secondary Data Analysis : A Method of which the Time Has Come. Qualitatve and Quantative Methods in Libraryes (QQML), 3, 619–626.

Khaled El Emam, Lucy Mosquera, R. H. (2020). Practical Synthetic Data Generation. O’Reilly Media, Inc.

Kniola, L. (2017). Plausible Adversaries in Re-Identification Risk Assessment. Phuse, Paper DH09, 1– 10.

Lee, J., & Clifton, C. (2012). Differential identifiability. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’12, 1041.

Li, N., Li, T., & Venkatasubramania, S. (2007). t -Closeness : Privacy Beyond k -Anonymity and -Diversity. IEEE 23rd International Conference, 3, 106–115.

Machanavajjhala, A., Kifer, D., Gehrhe, J., & VENKITASUBRAMANIAM, M. (2006). L-Diversity : Privacy Beyond k -Anonymity. Proceedings of the 22nd International Conference on Data Engineering, 1–36.

Mitchell, A. (2012). From data hoarding to data sharing. Journal of Direct, Data and Digital Marketing Practice, 13(4), 325–334.

Narayan, A. (2015). Distributed Differential Privacy and Applications.

Nelson, G. S. (2015). Practical Implications of Sharing Data: A Primer on Data Privacy, Anonymization, and De-Identification. SAS® Global Forum 2015, April 2015, 23.

Nissim, K., Steinke, T., Wood, A., Altman, M., Bembenek, A., Bun, M., Gaboardi, M., O ’brien, D. R., & Vadhan, S. (2017). Differential Privacy: A Primer for a Non-technical Audience * (Preliminary version). 1237235.

Parliament, K. (2019). The Data Protection Act. In National Council for Law Reporting.

Quach, S., Thaichon, P., Martin, K. D., Weaven, S., & Palmatier, R. W. (2022). Digital technologies: tensions in privacy and data. Journal of the Academy of Marketing Science.

Ramachandran, A., Singh, L., Porter, E., & Nagle, F. (2012). Exploring re-identification risks in public domains. 2012 10th Annual International Conference on Privacy, Security and Trust, PST 2012, 35– 42.

Reddy, S., & Prakash, O. (2014). UTILITY-PRIVACY TRADEOFF IN DATABASES : AN INFORMATION THEORETIC APPROACH. International Journal of Engineering & Science Research, 4(10), 608–612.

Reiter, J. P. (2015). Estimating Risks of Identification Disclosure in Microdata. Journal of the American Statistical Association, 100(472), 1103–1112.

Ribeiro, S. L., & Nakamura, E. T. (2019). Privacy Protection with Pseudonymization and Anonymization in a Health IoT System: Results from OCARIoT. Proceedings - 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering, BIBE 2019, 904–908.

Rocher, L., Hendrickx, J. M., & de Montjoye, Y.-A. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications, 10(1).

Rocher, L., Hendrickx, J. M., & Montjoye, Y. De. (n.d.). Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications, 2019.

Santu, S. K. K., Bindschadler, V., Zhai, C., & Gunter, C. A. (2018). NRF : A Naive Re-identification Framework. Proceedings of the 2018 Workshop on Privacy in the Electronic Society (WPES’18), 121–132.

Scaiano, M., Middleton, G., Arbuckle, L., Kolhatkar, V., Peyton, L., Dowling, M., Gipson, D. S., & El Emam, K. (2016). A unified framework for evaluating the risk of re-identification of text de-identification tools. Journal of Biomedical Informatics, 63, 174–183.

Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., & Tufano, P. (2012). Analytics: The real-world use of big data. IBM Global Business Services Saïd Business School at the University of Oxford, 1–20.

Shlomo, N. (2009). Releasing microdata: disclosure risk estimation, data masking and assessing utility. Journal of Privacy and Confidentiality, 1, 229–240.

Simon, G. E., Shortreed, S. M., Coley, R. Y., Penfold, R. B., Rossom, R. C., Waitzfelder, B. E., Sanchez, K., & Lynch, F. L. (2019). Assessing and Minimising Re-identification Risk in Research Data Derived from Health Care Records. EGEMs (Generating Evidence & Methods to Improve Patient Outcomes), 7(1), 1–9.

Sweeney, L. (2000). Simple demographics often identify people uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000, 1–34.

Sweeney, L., Loewenfeldt, M. V, & Perry, M. (2018). Saying it’s anonymous doesn’t make it so: Re-identifications of {textquotedblleft} anonymized{textquotedblright} law school data. Technology Science, 2018111301.

Sweeney, L., Yoo, J. S., Perovich, L., Boronow, K. E., Brown, P., & Brody, J. G. (2017). Re-identification risks in HIPAA Safe Harbor data: a study of data from one environmental health study. Technol Sci, 2017:20170.

Sweeny, L. (2002). k- ANONYMITY: A MODEL FOR PROTECTING PRIVACY 1. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557–570.

Templ, M., Kowarik, A., & Meindl, B. (2015). Statistical disclosure control for micro-data using the R package sdcMicro. Journal of Statistical Software, 67(4).

Wickham, R. J. (2019). Secondary Analysis Research. Journal of the Advanced Practitioner in Oncology, 10(4), 395–400.

Xia, W., Liu, Y., Wan, Z., Vorobeychik, Y., Kantacioglu, M., Nyemba, S., Clayton, E. W., & Malin, B. A. (2021). Enabling realistic health data re-identification risk assessment through adversarial modeling. Journal of the American Medical Informatics Association : JAMIA, 28(4), 744–752.

Yao, X., Zhou, X., & Ma, J. (2016). Differential Privacy of Big Data: An Overview. 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), 9(2), 7–12.

Yoo, J. S., Thaler, A., Sweeney, L., & Zang, J. (2018). Risks to Patient Privacy : A Re- identification of Patients in Maine and Vermont Statewide Hospital Data. Technology Science, 1–62.

5 August, 2022
How to Cite
Muturi, P., Kahonge, A., & Chepken, C. (2022). Assessing Identity Disclosure Risk in the Absence of Identified Datasets in the Public Domain. East African Journal of Information Technology, 5(1), 62-75.