IDEAS home Printed from https://ideas.repec.org/a/plo/pdig00/0000027.html
   My bibliography  Save this article

A proposed de-identification framework for a cohort of children presenting at a health facility in Uganda

Author

Listed:
  • Alishah Mawji
  • Holly Longstaff
  • Jessica Trawin
  • Dustin Dunsmuir
  • Clare Komugisha
  • Stefanie K Novakowski
  • Matthew O Wiens
  • Samuel Akech
  • Abner Tagoola
  • Niranjan Kissoon
  • J Mark Ansermino

Abstract

Data sharing has enormous potential to accelerate and improve the accuracy of research, strengthen collaborations, and restore trust in the clinical research enterprise. Nevertheless, there remains reluctancy to openly share raw data sets, in part due to concerns regarding research participant confidentiality and privacy. Statistical data de-identification is an approach that can be used to preserve privacy and facilitate open data sharing. We have proposed a standardized framework for the de-identification of data generated from cohort studies in children in a low-and-middle income country. We applied a standardized de-identification framework to a data sets comprised of 241 health related variables collected from a cohort of 1750 children with acute infections from Jinja Regional Referral Hospital in Eastern Uganda. Variables were labeled as direct and quasi-identifiers based on conditions of replicability, distinguishability, and knowability with consensus from two independent evaluators. Direct identifiers were removed from the data sets, while a statistical risk-based de-identification approach using the k-anonymity model was applied to quasi-identifiers. Qualitative assessment of the level of privacy invasion associated with data set disclosure was used to determine an acceptable re-identification risk threshold, and corresponding k-anonymity requirement. A de-identification model using generalization, followed by suppression was applied using a logical stepwise approach to achieve k-anonymity. The utility of the de-identified data was demonstrated using a typical clinical regression example. The de-identified data sets was published on the Pediatric Sepsis Data CoLaboratory Dataverse which provides moderated data access. Researchers are faced with many challenges when providing access to clinical data. We provide a standardized de-identification framework that can be adapted and refined based on specific context and risks. This process will be combined with moderated access to foster coordination and collaboration in the clinical research community.Author summary: Open Data is data that anyone can access, use, and share. Open Data has the potential to facilitate collaboration, enrich research, and advance the analytic capacity to inform decisions. Importantly, Open Data plays a role in fulfilling obligations to research participants and honoring the nature of medical research as a public good. Leaders in industry, academia, and regulatory agencies recognize the value in increased transparency and are focusing on how to openly share data while minimizing the safety risks to research participants. For example, making data open can pose a privacy risk to research participants who have shared personal health information. This risk can be mitigated using data de-identification, a process of removing personal information from a data sets so that an individual’s identity is no longer apparent or cannot be reasonably ascertained from the data. We introduce a simple, statistical risk-based framework for de-identification of clinical data that can be followed by any researcher. This framework will guide open data sharing while improving the protection of research participants.

Suggested Citation

  • Alishah Mawji & Holly Longstaff & Jessica Trawin & Dustin Dunsmuir & Clare Komugisha & Stefanie K Novakowski & Matthew O Wiens & Samuel Akech & Abner Tagoola & Niranjan Kissoon & J Mark Ansermino, 2022. "A proposed de-identification framework for a cohort of children presenting at a health facility in Uganda," PLOS Digital Health, Public Library of Science, vol. 1(8), pages 1-17, August.
  • Handle: RePEc:plo:pdig00:0000027
    DOI: 10.1371/journal.pdig.0000027
    as

    Download full text from publisher

    File URL: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000027
    Download Restriction: no

    File URL: https://journals.plos.org/digitalhealth/article/file?id=10.1371/journal.pdig.0000027&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pdig.0000027?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Heather A Piwowar & Roger S Day & Douglas B Fridsma, 2007. "Sharing Detailed Research Data Is Associated with Increased Citation Rate," PLOS ONE, Public Library of Science, vol. 2(3), pages 1-5, March.
    2. Templ, Matthias & Kowarik, Alexander & Meindl, Bernhard, 2015. "Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 67(i04).
    3. repec:plo:pone00:0239283 is not listed on IDEAS
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. repec:plo:pone00:0037552 is not listed on IDEAS
    2. repec:plo:pone00:0188511 is not listed on IDEAS
    3. Garret Christensen & Allan Dafoe & Edward Miguel & Don A Moore & Andrew K Rose, 2019. "A study of the impact of data sharing on article citations using journal policies as a natural experiment," PLOS ONE, Public Library of Science, vol. 14(12), pages 1-13, December.
    4. Andreoli-Versbach, Patrick & Mueller-Langer, Frank, 2014. "Open access to data: An ideal professed but not practised," Research Policy, Elsevier, vol. 43(9), pages 1621-1633.
    5. Benedikt Fecher & Sascha Friesike & Marcel Hebing, 2014. "What Drives Academic Data Sharing?," SOEPpapers on Multidisciplinary Panel Data Research 655, DIW Berlin, The German Socio-Economic Panel (SOEP).
    6. Javier Martínez-Vega & David Rodríguez-Rodríguez, 2022. "Protected Area Effectiveness in the Scientific Literature: A Decade-Long Bibliometric Analysis," Land, MDPI, vol. 11(6), pages 1-14, June.
    7. Mark J. McCabe & Frank Mueller-Langer, 2019. "Does Data Disclosure Increase Citations? Empirical Evidence from a Natural Experiment in Leading Economics Journals," JRC Working Papers on Digital Economy 2019-02, Joint Research Centre.
    8. repec:plo:pbio00:1002506 is not listed on IDEAS
    9. Harper, Lindsey M. & Kim, Youngseek, 2018. "Attitudinal, normative, and resource factors affecting psychologists’ intentions to adopt an open data badge: An empirical analysis," International Journal of Information Management, Elsevier, vol. 41(C), pages 23-32.
    10. repec:plo:pbio00:2002846 is not listed on IDEAS
    11. repec:osf:socarx:4bw9e_v3 is not listed on IDEAS
    12. repec:plo:pone00:0092590 is not listed on IDEAS
    13. Kai Li & Jason Rollins & Erjia Yan, 2018. "Web of Science use in published research and review papers 1997–2017: a selective, dynamic, cross-domain, content-based analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 115(1), pages 1-20, April.
    14. repec:plo:pone00:0230416 is not listed on IDEAS
    15. Keren Weinshall & Lee Epstein, 2020. "Developing High‐Quality Data Infrastructure for Legal Analytics: Introducing the Israeli Supreme Court Database," Journal of Empirical Legal Studies, John Wiley & Sons, vol. 17(2), pages 416-434, June.
    16. Renata Gonçalves Curty & Kevin Crowston & Alison Specht & Bruce W Grant & Elizabeth D Dalton, 2017. "Attitudes and norms affecting scientists’ data reuse," PLOS ONE, Public Library of Science, vol. 12(12), pages 1-22, December.
    17. Barbara McGillivray & Paola Marongiu & Nilo Pedrazzini & Marton Ribary & Mandy Wigdorowitz & Eleonora Zordan, 2022. "Deep Impact: A Study on the Impact of Data Papers and Datasets in the Humanities and Social Sciences," Publications, MDPI, vol. 10(4), pages 1-40, October.
    18. Eirini Delikoura & Dimitrios Kouis, 2021. "Open Research Data and Open Peer Review: Perceptions of a Medical and Health Sciences Community in Greece," Publications, MDPI, vol. 9(2), pages 1-19, March.
    19. Colléter, Mathieu & Valls, Audrey & Guitton, Jérôme & Gascuel, Didier & Pauly, Daniel & Christensen, Villy, 2015. "Global overview of the applications of the Ecopath with Ecosim modeling approach using the EcoBase models repository," Ecological Modelling, Elsevier, vol. 302(C), pages 42-53.
    20. repec:plo:pone00:0239283 is not listed on IDEAS
    21. Anneke Zuiderwijk, 2024. "Researchers’ Willingness and Ability to Openly Share Their Research Data: A Survey of COVID-19 Pandemic-Related Factors," SAGE Open, , vol. 14(1), pages 21582440241, March.
    22. Doris Bambey & Louise Corti & Michael Diepenbroek & Heidemarie Hanekop & Betina Hollstein & Sabine Imeri & Hubert Knoblauch & Susanne Kretzer & Christian Meier zu Verl & Christian Meyer & Alexia Meyer, 2018. "Archivierung und Zugang zu Qualitativen Daten," RatSWD Working Papers 267, German Data Forum (RatSWD).
    23. Liwei Zhang & Liang Ma, 2021. "Does open data boost journal impact: evidence from Chinese economics," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(4), pages 3393-3419, April.
    24. Iman Tahamtan & Askar Safipour Afshar & Khadijeh Ahamdzadeh, 2016. "Factors affecting number of citations: a comprehensive review of the literature," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(3), pages 1195-1225, June.
    25. Shibayama, Sotaro & Lawson, Cornelia, 2021. "The use of rewards in the sharing of research resources," Research Policy, Elsevier, vol. 50(7).
    26. Ehsan Mohammadi & Mike Thelwall, 2013. "Assessing non-standard article impact using F1000 labels," Scientometrics, Springer;Akadémiai Kiadó, vol. 97(2), pages 383-395, November.
    27. repec:osf:socarx:4bw9e_v1 is not listed on IDEAS
    28. Jennifer C Molloy, 2012. "The Open Knowledge Foundation: Open Data Means Better Science," Working Papers id:4686, eSocialSciences.
    29. Mike Thelwall & Marcus Munafò & Amalia Mas-Bleda & Emma Stuart & Meiko Makita & Verena Weigert & Chris Keene & Nushrat Khan & Katie Drax & Kayvan Kousha, 2020. "Is useful research data usually shared? An investigation of genome-wide association study summary statistics," PLOS ONE, Public Library of Science, vol. 15(2), pages 1-11, February.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pdig00:0000027. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: digitalhealth (email available below). General contact details of provider: https://journals.plos.org/digitalhealth .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.