IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1011175.html
   My bibliography  Save this article

Enabling interpretable machine learning for biological data with reliability scores

Author

Listed:
  • K D Ahlquist
  • Lauren A Sugden
  • Sohini Ramachandran

Abstract

Machine learning tools have proven useful across biological disciplines, allowing researchers to draw conclusions from large datasets, and opening up new opportunities for interpreting complex and heterogeneous biological data. Alongside the rapid growth of machine learning, there have also been growing pains: some models that appear to perform well have later been revealed to rely on features of the data that are artifactual or biased; this feeds into the general criticism that machine learning models are designed to optimize model performance over the creation of new biological insights. A natural question arises: how do we develop machine learning models that are inherently interpretable or explainable? In this manuscript, we describe the SWIF(r) reliability score (SRS), a method building on the SWIF(r) generative framework that reflects the trustworthiness of the classification of a specific instance. The concept of the reliability score has the potential to generalize to other machine learning methods. We demonstrate the utility of the SRS when faced with common challenges in machine learning including: 1) an unknown class present in testing data that was not present in training data, 2) systemic mismatch between training and testing data, and 3) instances of testing data that have missing values for some attributes. We explore these applications of the SRS using a range of biological datasets, from agricultural data on seed morphology, to 22 quantitative traits in the UK Biobank, and population genetic simulations and 1000 Genomes Project data. With each of these examples, we demonstrate how the SRS can allow researchers to interrogate their data and training approach thoroughly, and to pair their domain-specific knowledge with powerful machine-learning frameworks. We also compare the SRS to related tools for outlier and novelty detection, and find that it has comparable performance, with the advantage of being able to operate when some data are missing. The SRS, and the broader discussion of interpretable scientific machine learning, will aid researchers in the biological machine learning space as they seek to harness the power of machine learning without sacrificing rigor and biological insight.Author summary: Machine learning methods are incredibly powerful at performing tasks such as classification and clustering, but they also pose unique problems that can limit new insights. Complex machine learning models may reach conclusions that are difficult or impossible for researchers to understand after-the-fact, sometimes producing biased or meaningless results. It is therefore essential that researchers have tools that allow them to understand how machine learning tools reach their conclusions, so that they can effectively design models. This paper builds on the machine learning method SWIF(r), originally designed to detect regions in the genome targeted by natural selection. Our new method, the SWIF(r) Reliability Score (SRS), can help researchers evaluate how trustworthy the prediction of a SWIF(r) model is when classifying a specific instance of data. We also show how SWIF(r) and the SRS can be used for biological problems outside the original scope of SWIF(r). We show that the SRS is helpful in situations where the data used to train the machine learning model fails to represent the testing data in some way. The SRS can be used across many different disciplines, and has unique properties for scientific machine learning research.

Suggested Citation

  • K D Ahlquist & Lauren A Sugden & Sohini Ramachandran, 2023. "Enabling interpretable machine learning for biological data with reliability scores," PLOS Computational Biology, Public Library of Science, vol. 19(5), pages 1-24, May.
  • Handle: RePEc:plo:pcbi00:1011175
    DOI: 10.1371/journal.pcbi.1011175
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011175
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011175&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1011175?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Pardis C. Sabeti & Patrick Varilly & Ben Fry & Jason Lohmueller & Elizabeth Hostetter & Chris Cotsapas & Xiaohui Xie & Elizabeth H. Byrne & Steven A. McCarroll & Rachelle Gaudet & Stephen F. Schaffner, 2007. "Genome-wide detection and characterization of positive selection in human populations," Nature, Nature, vol. 449(7164), pages 913-918, October.
    2. Lauren Alpert Sugden & Elizabeth G. Atkinson & Annie P. Fischer & Stephen Rong & Brenna M. Henn & Sohini Ramachandran, 2018. "Localization of adaptive variants in human genomes using averaged one-dependence estimation," Nature Communications, Nature, vol. 9(1), pages 1-14, December.
    3. repec:plo:pone00:0183810 is not listed on IDEAS
    4. repec:plo:pgen00:1007387 is not listed on IDEAS
    5. repec:plo:pgen00:1005004 is not listed on IDEAS
    6. repec:plo:pbio00:0040072 is not listed on IDEAS
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. repec:plo:pgen00:1002326 is not listed on IDEAS
    2. Vasili Pankratov & Milyausha Yunusbaeva & Sergei Ryakhovsky & Maksym Zarodniuk & Bayazit Yunusbayev, 2022. "Prioritizing autoimmunity risk variants for functional analyses by fine-mapping mutations under natural selection," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    3. Chen, Hua & Hey, Jody & Slatkin, Montgomery, 2015. "A hidden Markov model for investigating recent positive selection through haplotype structure," Theoretical Population Biology, Elsevier, vol. 99(C), pages 18-30.
    4. Mohammad Hossein Olyaee & Alireza Khanteymoori & Khosrow Khalifeh, 2020. "A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model," PLOS ONE, Public Library of Science, vol. 15(10), pages 1-19, October.
    5. Michael DeGiorgio & Zachary A Szpiech, 2022. "A spatially aware likelihood test to detect sweeps from haplotype distributions," PLOS Genetics, Public Library of Science, vol. 18(4), pages 1-37, April.
    6. Roy N. Platt II & Egie E. Enabulele & Ehizogie Adeyemi & Marian O. Agbugui & Oluwaremilekun G. Ajakaye & Ebube C. Amaechi & Chika P. Ejikeugwu & Christopher Igbeneghu & Victor S. Njom & Precious Dlami, 2025. "Genomic data reveal a north-south split and introgression history of blood fluke populations across Africa," Nature Communications, Nature, vol. 16(1), pages 1-14, December.
    7. Xiao Zhang & Mark Blaxter & Jonathan M. D. Wood & Alan Tracey & Shane McCarthy & Peter Thorpe & Jack G. Rayner & Shangzhe Zhang & Kirstin L. Sikkink & Susan L. Balenger & Nathan W. Bailey, 2024. "Temporal genomics in Hawaiian crickets reveals compensatory intragenomic coadaptation during adaptive evolution," Nature Communications, Nature, vol. 15(1), pages 1-19, December.
    8. Lauren A. Choate & Gilad Barshad & Pierce W. McMahon & Iskander Said & Edward J. Rice & Paul R. Munn & James J. Lewis & Charles G. Danko, 2021. "Multiple stages of evolutionary change in anthrax toxin receptor expression in humans," Nature Communications, Nature, vol. 12(1), pages 1-12, December.
    9. Pol Solé-Navais & Julius Juodakis & Karin Ytterberg & Xiaoping Wu & Jonathan P. Bradfield & Marc Vaudel & Abigail L. LaBella & Øyvind Helgeland & Christopher Flatley & Frank Geller & Moshe Finel & Men, 2024. "Genome-wide analyses of neonatal jaundice reveal a marked departure from adult bilirubin metabolism," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    10. repec:plo:pone00:0010207 is not listed on IDEAS
    11. Gabrielle C. Ngwana-Joseph & Jody E. Phelan & Emilia Manko & Jamille G. Dombrowski & Simone Silva Santos & Martha Suarez-Mutis & Gabriel Vélez-Tobón & Alberto Tobón Castaño & Ricardo Luiz Dantas Macha, 2024. "Genomic analysis of global Plasmodium vivax populations reveals insights into the evolution of drug resistance," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    12. Liye Zhang & Neahga Leonard & Rick Passaro & Mai Sy Luan & Pham Tuyen & Le Thi Ngoc Han & Nguyen Huy Cam & Larry Vogelnest & Michael Lynch & Amanda E. Fine & Nguyen Thi Thanh Nga & Nguyen Long & Benja, 2024. "Genomic adaptation to small population size and saltwater consumption in the critically endangered Cat Ba langur," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    13. Yichen Zheng & Thomas Wiehe, 2019. "Adaptation in structured populations and fuzzy boundaries between hard and soft sweeps," PLOS Computational Biology, Public Library of Science, vol. 15(11), pages 1-32, November.
    14. Hyeongmin Kim & Ki Duk Song & Hyeon Jeong Kim & WonCheoul Park & Jaemin Kim & Taeheon Lee & Dong-Hyun Shin & Woori Kwak & Young-jun Kwon & Samsun Sung & Sunjin Moon & Kyung-Tai Lee & Namshin Kim & Joo, 2015. "Exploring the Genetic Signature of Body Size in Yucatan Miniature Pig," PLOS ONE, Public Library of Science, vol. 10(4), pages 1-16, April.
    15. Ran Tian & Yaolei Zhang & Hui Kang & Fan Zhang & Zhihong Jin & Jiahao Wang & Peijun Zhang & Xuming Zhou & Janet M. Lanyon & Helen L. Sneath & Lucy Woolford & Guangyi Fan & Songhai Li & Inge Seim, 2024. "Sirenian genomes illuminate the evolution of fully aquatic species within the mammalian superorder afrotheria," Nature Communications, Nature, vol. 15(1), pages 1-19, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1011175. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.