IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1010669.html
   My bibliography  Save this article

Ten quick tips for sequence-based prediction of protein properties using machine learning

Author

Listed:
  • Qingzhen Hou
  • Katharina Waury
  • Dea Gogishvili
  • K Anton Feenstra

Abstract

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.

Suggested Citation

  • Qingzhen Hou & Katharina Waury & Dea Gogishvili & K Anton Feenstra, 2022. "Ten quick tips for sequence-based prediction of protein properties using machine learning," PLOS Computational Biology, Public Library of Science, vol. 18(12), pages 1-15, December.
  • Handle: RePEc:plo:pcbi00:1010669
    DOI: 10.1371/journal.pcbi.1010669
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010669
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1010669&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1010669?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Anne-Laure Boulesteix, 2015. "Ten Simple Rules for Reducing Overoptimistic Reporting in Methodological Computational Research," PLOS Computational Biology, Public Library of Science, vol. 11(4), pages 1-6, April.
    2. Avni Malik & Paranjay Patel & Lubaina Ehsan & Shan Guleria & Thomas Hartka & Sodiq Adewole & Sana Syed, 2021. "Ten simple rules for engaging with artificial intelligence in biomedicine," PLOS Computational Biology, Public Library of Science, vol. 17(2), pages 1-11, February.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Maximilian M Mandl & Sabine Hoffmann & Sebastian Bieringer & Anna E Jacob & Marie Kraft & Simon Lemster & Anne-Laure Boulesteix, 2024. "Raising awareness of uncertain choices in empirical data analysis: A teaching concept toward replicable research practices," PLOS Computational Biology, Public Library of Science, vol. 20(3), pages 1-10, March.
    2. Silke Janitza & Ender Celik & Anne-Laure Boulesteix, 2018. "A computationally fast variable importance test for random forests for high-dimensional data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(4), pages 885-915, December.
    3. Theresa Ullmann & Anna Beer & Maximilian Hünemörder & Thomas Seidl & Anne-Laure Boulesteix, 2023. "Over-optimistic evaluation and reporting of novel cluster algorithms: an illustrative study," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(1), pages 211-238, March.
    4. Furlong, Aidan & Alsafadi, Farah & Palmtag, Scott & Godfrey, Andrew & Wu, Xu, 2025. "Data-driven prediction and uncertainty quantification of PWR crud-induced power shift using convolutional neural networks," Energy, Elsevier, vol. 316(C).
    5. Christian Hennig, 2022. "An empirical comparison and characterisation of nine popular clustering methods," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(1), pages 201-229, March.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1010669. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.