IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1010979.html
   My bibliography  Save this article

On convolutional neural networks for selection inference: Revealing the effect of preprocessing on model learning and the capacity to discover novel patterns

Author

Listed:
  • Ryan M Cecil
  • Lauren A Sugden

Abstract

A central challenge in population genetics is the detection of genomic footprints of selection. As machine learning tools including convolutional neural networks (CNNs) have become more sophisticated and applied more broadly, these provide a logical next step for increasing our power to learn and detect such patterns; indeed, CNNs trained on simulated genome sequences have recently been shown to be highly effective at this task. Unlike previous approaches, which rely upon human-crafted summary statistics, these methods are able to be applied directly to raw genomic data, allowing them to potentially learn new signatures that, if well-understood, could improve the current theory surrounding selective sweeps. Towards this end, we examine a representative CNN from the literature, paring it down to the minimal complexity needed to maintain comparable performance; this low-complexity CNN allows us to directly interpret the learned evolutionary signatures. We then validate these patterns in more complex models using metrics that evaluate feature importance. Our findings reveal that preprocessing steps, which determine how the population genetic data is presented to the model, play a central role in the learned prediction method. This results in models that mimic previously-defined summary statistics; in one case, the summary statistic itself achieves similarly high accuracy. For evolutionary processes that are less well understood than selective sweeps, we hope this provides an initial framework for using CNNs in ways that go beyond simply achieving high classification performance. Instead, we propose that CNNs might be useful as tools for learning novel patterns that can translate to easy-to-implement summary statistics available to a wider community of researchers.Author summary: The ever-increasing power and complexity of machine learning tools presents the scientific community with both unique opportunities and unique challenges. On the one hand, these data-driven approaches have led to state-of-the-art advances on a variety of research problems spanning many fields. On the other, these apparent performance improvements come at the cost of interpretability: it is difficult to know how a model makes its predictions. This is compounded by the computational sophistication of machine learning models which can lend an air of objectivity, often masking ways in which bias may be baked into the modeling decisions or the data itself. We present here a case study, examining these issues in the context of a central problem in population genetics: detecting patterns of selection from genome data. Through this application, we show how human decision-making can encourage the model to see what we want it to see in various ways. By understanding how these models work, and how they respond to the particular way in which data is presented, we have a chance of creating new frameworks that are capable of discovering novel patterns.

Suggested Citation

  • Ryan M Cecil & Lauren A Sugden, 2023. "On convolutional neural networks for selection inference: Revealing the effect of preprocessing on model learning and the capacity to discover novel patterns," PLOS Computational Biology, Public Library of Science, vol. 19(11), pages 1-20, November.
  • Handle: RePEc:plo:pcbi00:1010979
    DOI: 10.1371/journal.pcbi.1010979
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010979
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1010979&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1010979?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. repec:plo:pgen00:1005004 is not listed on IDEAS
    2. Pardis C. Sabeti & David E. Reich & John M. Higgins & Haninah Z. P. Levine & Daniel J. Richter & Stephen F. Schaffner & Stacey B. Gabriel & Jill V. Platko & Nick J. Patterson & Gavin J. McDonald & Han, 2002. "Detecting recent positive selection in the human genome from haplotype structure," Nature, Nature, vol. 419(6909), pages 832-837, October.
    3. repec:plo:pbio00:0040072 is not listed on IDEAS
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. repec:plo:pgen00:1002410 is not listed on IDEAS
    2. Champagnat, Nicolas & Lambert, Amaury, 2013. "Splitting trees with neutral Poissonian mutations II: Largest and oldest families," Stochastic Processes and their Applications, Elsevier, vol. 123(4), pages 1368-1414.
    3. repec:plo:pgen00:1000960 is not listed on IDEAS
    4. Devansh Pandey & Mariana Harris & Nandita R. Garud & Vagheesh M. Narasimhan, 2024. "Leveraging ancient DNA to uncover signals of natural selection in Europe lost due to admixture or drift," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    5. Chen, Hua & Hey, Jody & Slatkin, Montgomery, 2015. "A hidden Markov model for investigating recent positive selection through haplotype structure," Theoretical Population Biology, Elsevier, vol. 99(C), pages 18-30.
    6. repec:plo:pbio00:0040072 is not listed on IDEAS
    7. repec:plo:pone00:0007070 is not listed on IDEAS
    8. Bing Guo & Victor Borda & Roland Laboulaye & Michele D. Spring & Mariusz Wojnarski & Brian A. Vesely & Joana C. Silva & Norman C. Waters & Timothy D. O’Connor & Shannon Takala-Harrison, 2024. "Strong positive selection biases identity-by-descent-based inferences of recent demography and population structure in Plasmodium falciparum," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    9. Yupeng Sang & Zhiqin Long & Xuming Dan & Jiajun Feng & Tingting Shi & Changfu Jia & Xinxin Zhang & Qiang Lai & Guanglei Yang & Hongying Zhang & Xiaoting Xu & Huanhuan Liu & Yuanzhong Jiang & Pär K. In, 2022. "Genomic insights into local adaptation and future climate-induced vulnerability of a keystone forest tree in East Asia," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    10. Xinkai Tong & Dong Chen & Jianchao Hu & Shiyao Lin & Ziqi Ling & Huashui Ai & Zhiyan Zhang & Lusheng Huang, 2023. "Accurate haplotype construction and detection of selection signatures enabled by high quality pig genome sequences," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    11. repec:plo:pbio00:0050171 is not listed on IDEAS
    12. repec:plo:pgen00:1003521 is not listed on IDEAS
    13. repec:plo:pgen00:1003011 is not listed on IDEAS
    14. Rafajlović, M. & Klassmann, A. & Eriksson, A. & Wiehe, T. & Mehlig, B., 2014. "Demography-adjusted tests of neutrality based on genome-wide SNP data," Theoretical Population Biology, Elsevier, vol. 95(C), pages 1-12.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1010979. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.