IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1008297.html
   My bibliography  Save this article

Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species

Author

Listed:
  • Yumin Zheng
  • Haohan Wang
  • Yang Zhang
  • Xin Gao
  • Eric P Xing
  • Min Xu

Abstract

In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis-determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use four species and build cross-species training sets with two of them and evaluate the performance of the remaining ones. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.Author summary: The key to understanding the mechanism of translation regulation and mRNA metabolism is to identify the cis-determinants of PAS on the DNA sequence. PAS leads to correct identification of Poly(A) sites which play an essential role in understanding human diseases. While many researchers have employed deep learning methods to improve the performance of PAS identification, an underlying problem is the expensive and time-consuming nature of PAS data collection, which makes the application of deep learning models for identifying PAS from a broad range of species a tough task. We attempt to use domain generalization methods, inspired by its thrive in the field of computer vision, to overcome the insufficient annotation data challenge in PAS data. Here, empirical results suggest that our proposed model Poly(A)-DG can extract species-invariant features from multiple training species and be directly applied to the target species without fine-tuning. Furthermore, Poly(A)-DG is a promising practical tool for PAS identification with its stable performance on insufficient or species-imbalanced training data. We share the implementation of our proposed model on the GitHub. (https://github.com/Szym29/PolyADG).

Suggested Citation

  • Yumin Zheng & Haohan Wang & Yang Zhang & Xin Gao & Eric P Xing & Min Xu, 2020. "Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species," PLOS Computational Biology, Public Library of Science, vol. 16(11), pages 1-21, November.
  • Handle: RePEc:plo:pcbi00:1008297
    DOI: 10.1371/journal.pcbi.1008297
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008297
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1008297&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1008297?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1008297. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.