Author
Listed:
- Yansong Wang
- Zilong Hou
- Yuning Yang
- Ka-chun Wong
- Xiangtao Li
Abstract
Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene’s transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning–based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework.Author summary: Numerous evidence suggest that genes regulated by enhancers located in non-coding DNA regions are involved in a myriad of biological activities. To fully understand the regulatory role and mechanisms of enhancers on genes, the localization and identification of enhancers is essential. Several experimental biological methods are capable of localizing enhancers, however, these methods are resource intensive. To address this limitation, we developed a stacked multivariate fusion framework, called SMFM to identify and analyze enhancers with high accuracy and efficiency based on enhancer-specific dynamic semantic information and multi-source biological properties. The performance of the model is verified by experiments comparing different feature algorithms and classification algorithms. The superiority of our method is demonstrated by comparing it with several state-of-the-art algorithms. In addition, several analytical experiments demonstrate that SMFM is capable of recognizing enhancers in different tissues and detecting motifs in enhancers. To the best of our knowledge, this is the first computational approach that uses enhancer-specific dynamic semantic information to identify enhancers from regulatory DNA sequences and interpret them. It is expected that the SMFM model will effectively target enhancers and provide valid candidates for further biological experiments.
Suggested Citation
Yansong Wang & Zilong Hou & Yuning Yang & Ka-chun Wong & Xiangtao Li, 2022.
"Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework,"
PLOS Computational Biology, Public Library of Science, vol. 18(12), pages 1-33, December.
Handle:
RePEc:plo:pcbi00:1010779
DOI: 10.1371/journal.pcbi.1010779
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1010779. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.