IDEAS home Printed from https://ideas.repec.org/a/plo/pdig00/0000027.html

A proposed de-identification framework for a cohort of children presenting at a health facility in Uganda

Author

Listed:
  • Alishah Mawji
  • Holly Longstaff
  • Jessica Trawin
  • Dustin Dunsmuir
  • Clare Komugisha
  • Stefanie K Novakowski
  • Matthew O Wiens
  • Samuel Akech
  • Abner Tagoola
  • Niranjan Kissoon
  • J Mark Ansermino

Abstract

Data sharing has enormous potential to accelerate and improve the accuracy of research, strengthen collaborations, and restore trust in the clinical research enterprise. Nevertheless, there remains reluctancy to openly share raw data sets, in part due to concerns regarding research participant confidentiality and privacy. Statistical data de-identification is an approach that can be used to preserve privacy and facilitate open data sharing. We have proposed a standardized framework for the de-identification of data generated from cohort studies in children in a low-and-middle income country. We applied a standardized de-identification framework to a data sets comprised of 241 health related variables collected from a cohort of 1750 children with acute infections from Jinja Regional Referral Hospital in Eastern Uganda. Variables were labeled as direct and quasi-identifiers based on conditions of replicability, distinguishability, and knowability with consensus from two independent evaluators. Direct identifiers were removed from the data sets, while a statistical risk-based de-identification approach using the k-anonymity model was applied to quasi-identifiers. Qualitative assessment of the level of privacy invasion associated with data set disclosure was used to determine an acceptable re-identification risk threshold, and corresponding k-anonymity requirement. A de-identification model using generalization, followed by suppression was applied using a logical stepwise approach to achieve k-anonymity. The utility of the de-identified data was demonstrated using a typical clinical regression example. The de-identified data sets was published on the Pediatric Sepsis Data CoLaboratory Dataverse which provides moderated data access. Researchers are faced with many challenges when providing access to clinical data. We provide a standardized de-identification framework that can be adapted and refined based on specific context and risks. This process will be combined with moderated access to foster coordination and collaboration in the clinical research community.Author summary: Open Data is data that anyone can access, use, and share. Open Data has the potential to facilitate collaboration, enrich research, and advance the analytic capacity to inform decisions. Importantly, Open Data plays a role in fulfilling obligations to research participants and honoring the nature of medical research as a public good. Leaders in industry, academia, and regulatory agencies recognize the value in increased transparency and are focusing on how to openly share data while minimizing the safety risks to research participants. For example, making data open can pose a privacy risk to research participants who have shared personal health information. This risk can be mitigated using data de-identification, a process of removing personal information from a data sets so that an individual’s identity is no longer apparent or cannot be reasonably ascertained from the data. We introduce a simple, statistical risk-based framework for de-identification of clinical data that can be followed by any researcher. This framework will guide open data sharing while improving the protection of research participants.

Suggested Citation

  • Alishah Mawji & Holly Longstaff & Jessica Trawin & Dustin Dunsmuir & Clare Komugisha & Stefanie K Novakowski & Matthew O Wiens & Samuel Akech & Abner Tagoola & Niranjan Kissoon & J Mark Ansermino, 2022. "A proposed de-identification framework for a cohort of children presenting at a health facility in Uganda," PLOS Digital Health, Public Library of Science, vol. 1(8), pages 1-17, August.
  • Handle: RePEc:plo:pdig00:0000027
    DOI: 10.1371/journal.pdig.0000027
    as

    Download full text from publisher

    File URL: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000027
    Download Restriction: no

    File URL: https://journals.plos.org/digitalhealth/article/file?id=10.1371/journal.pdig.0000027&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pdig.0000027?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Templ, Matthias & Kowarik, Alexander & Meindl, Bernhard, 2015. "Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 67(i04).
    2. Heather A Piwowar & Roger S Day & Douglas B Fridsma, 2007. "Sharing Detailed Research Data Is Associated with Increased Citation Rate," PLOS ONE, Public Library of Science, vol. 2(3), pages 1-5, March.
    3. repec:plo:pone00:0239283 is not listed on IDEAS
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Liwei Zhang & Liang Ma, 2023. "Is open science a double-edged sword?: data sharing and the changing citation pattern of Chinese economics articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(5), pages 2803-2818, May.
    2. repec:plo:pone00:0037552 is not listed on IDEAS
    3. repec:plo:pone00:0188511 is not listed on IDEAS
    4. Garret Christensen & Allan Dafoe & Edward Miguel & Don A Moore & Andrew K Rose, 2019. "A study of the impact of data sharing on article citations using journal policies as a natural experiment," PLOS ONE, Public Library of Science, vol. 14(12), pages 1-13, December.
    5. Denis Huschka & Claudia Oellers & Notburga Ott & Gert G. Wagner, 2011. "Datenmanagement und Data Sharing. Erfahrungen in den Sozial- und Wirtschaftswissenschaften," RatSWD Working Papers 184, German Data Forum (RatSWD).
    6. Edward Miguel, 2021. "Evidence on Research Transparency in Economics," Journal of Economic Perspectives, American Economic Association, vol. 35(3), pages 193-214, Summer.
    7. Andreoli-Versbach, Patrick & Mueller-Langer, Frank, 2014. "Open access to data: An ideal professed but not practised," Research Policy, Elsevier, vol. 43(9), pages 1621-1633.
    8. Stefan Reichmann & Thomas Klebel & Ilire Hasani‐Mavriqi & Tony Ross‐Hellauer, 2021. "Between administration and research: Understanding data management practices in an institutional context," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(11), pages 1415-1431, November.
    9. Benedikt Fecher & Sascha Friesike & Marcel Hebing, 2014. "What Drives Academic Data Sharing?," SOEPpapers on Multidisciplinary Panel Data Research 655, DIW Berlin, The German Socio-Economic Panel (SOEP).
    10. Javier Martínez-Vega & David Rodríguez-Rodríguez, 2022. "Protected Area Effectiveness in the Scientific Literature: A Decade-Long Bibliometric Analysis," Land, MDPI, vol. 11(6), pages 1-14, June.
    11. repec:plo:pone00:0110268 is not listed on IDEAS
    12. Mark J. McCabe & Frank Mueller-Langer, 2019. "Does Data Disclosure Increase Citations? Empirical Evidence from a Natural Experiment in Leading Economics Journals," JRC Working Papers on Digital Economy 2019-02, Joint Research Centre.
    13. Benedikt Fecher & Sascha Friesike & Marcel Hebing, 2015. "What Drives Academic Data Sharing?," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-25, February.
    14. Klebel, Thomas & Traag, Vincent, 2024. "Introduction to causality in science studies," SocArXiv 4bw9e, Center for Open Science.
    15. Isabella Peters & Peter Kraker & Elisabeth Lex & Christian Gumpenberger & Juan Gorraiz, 2016. "Research data explored: an extended analysis of citations and altmetrics," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(2), pages 723-744, May.
    16. Ale Ebrahim, Nader & Salehi, Hadi & Embi, Mohamed Amin & Habibi Tanha, Farid & Gholizadeh, Hossein & Motahar, Seyed Mohammad & Ordi, Ali, 2013. "Effective Strategies for Increasing Citation Frequency," MPRA Paper 50919, University Library of Munich, Germany, revised 12 Oct 2013.
    17. repec:plo:pbio00:1002506 is not listed on IDEAS
    18. Marjan Bakker & Jelte M Wicherts, 2014. "Outlier Removal and the Relation with Reporting Errors and Quality of Psychological Research," PLOS ONE, Public Library of Science, vol. 9(7), pages 1-9, July.
    19. Harper, Lindsey M. & Kim, Youngseek, 2018. "Attitudinal, normative, and resource factors affecting psychologists’ intentions to adopt an open data badge: An empirical analysis," International Journal of Information Management, Elsevier, vol. 41(C), pages 23-32.
    20. repec:plo:pbio00:2002846 is not listed on IDEAS
    21. Benedikt Fecher & Sascha Friesike & Marcel Hebing, 2014. "What Drives Academic Data Sharing?," RatSWD Working Papers 236, German Data Forum (RatSWD).
    22. Stefan Stieglitz & Konstantin Wilms & Milad Mirbabaie & Lennart Hofeditz & Bela Brenger & Ania López & Stephanie Rehwald, 2020. "When are researchers willing to share their data? – Impacts of values and uncertainty on open data in academia," PLOS ONE, Public Library of Science, vol. 15(7), pages 1-20, July.
    23. repec:plo:pcbi00:1003285 is not listed on IDEAS
    24. Kwon, Seokbeom, 2025. "Competition or diversion? Effect of public sharing of data on research productivity of data provider," Research Policy, Elsevier, vol. 54(9).
    25. Xie, Qing & Wang, Jiamin & Kim, Giyeong & Lee, Soobin & Song, Min, 2021. "A sensitivity analysis of factors influential to the popularity of shared data in data repositories," Journal of Informetrics, Elsevier, vol. 15(3).
    26. Jan H. Höffler, 2017. "Replication and Economics Journal Policies," American Economic Review, American Economic Association, vol. 107(5), pages 52-55, May.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pdig00:0000027. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: digitalhealth (email available below). General contact details of provider: https://journals.plos.org/digitalhealth .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.