Author
Abstract
Identifying novel and functional RNA structures remains a significant challenge in RNA motif design and is crucial for developing RNA-based therapeutics. Here we introduce a computational topology-based approach with unsupervised machine-learning algorithms to estimate the database size and content of RNA-like graph topologies. Specifically, we apply graph theory enumeration to generate all 110,667 possible 2D dual graphs for vertex numbers ranging from 2 to 9. Among them, only 0.11% (121 dual graphs) correspond to approximately 200,000 known RNA atomic fragments/substructures (collected in 2021) using the RNA-as-Graphs (RAG) framework. The remaining 99.89% of the dual graphs may be RNA-like or non-RNA-like. To determine which dual graphs in the 99.89% hypothetical set are more likely to be associated with RNA structures, we apply computational topology descriptors using the Persistent Spectral Graphs (PSG) method to characterize each graph using 19 PSG-based features and use clustering algorithms that partition all possible dual graphs into two clusters. The cluster with the higher percentage of known dual graphs for RNA is defined as the “RNA-like" cluster, while the other is considered as “non-RNA-like". The distance between each dual graph and the center of the RNA-like cluster represents the likelihood of it belonging to RNA structures. From validation, our PSG-based RNA-like cluster includes 97.3% of the 121 known RNA dual graphs, suggesting good performance. Furthermore, 46.017% of the hypothetical RNAs are predicted to be RNA-like. Among the top 15 graphs identified as high-likelihood candidates for novel RNA motifs, 4 were confirmed from the RNA dataset collected in 2022. Significantly, we observe that all the top 15 RNA-like dual graphs can be separated into multiple subgraphs, whereas the top 15 non-RNA-like dual graphs tend not to have any subgraphs (subgraphs preserve pseudoknots and junctions). Moreover, a significant topological difference between top RNA-like and non-RNA-like graphs is evident when comparing their topological features (e.g., Betti-0 and Betti-1 numbers). These findings provide valuable insights into the size of the RNA motif universe and RNA design strategies, offering a novel framework for predicting RNA graph topologies and guiding the discovery of novel RNA motifs, perhaps anti-viral therapeutics by subgraph assembly.Author summary: This work tackles a key question in RNA motif design: how large is the universe of RNA-like structures? To explore this, we develop a computational framework that uses graph theory and topological data analysis to estimate the size and content of the universe of RNA-like graph topologies. Specifically, we generate 110,667 possible 2D dual graph topologies. Among these, only 0.11% dual graphs correspond to approximately 200,000 known RNA atomic fragments/substructures. To evaluate the remaining 99.89% of the dual graphs that may or may not correspond to RNA structures, we use persistent spectral graph features and machine learning to partition all possible dual graphs into “RNA-like" and “non-RNA-like" clusters. Our method accurately identifies 97.3% of known RNA structures as RNA-like and predicts that 46.017% of the hypothetical RNAs could be potential RNA motifs. We also identify 15 high-likelihood candidates for novel RNA structures, four of which were confirmed in newly collected data in 2022. Importantly, we discover that all top RNA-like graphs tend to break down into smaller functional substructures that preserve pseudoknots and junctions. This framework opens new directions for rational RNA design and the discovery of RNA-based therapeutics.
Suggested Citation
Rui Wang & Tamar Schlick, 2025.
"How large is the universe of RNA-like motifs? A clustering analysis of RNA graph motifs using topological descriptors,"
PLOS Computational Biology, Public Library of Science, vol. 21(7), pages 1-19, July.
Handle:
RePEc:plo:pcbi00:1013230
DOI: 10.1371/journal.pcbi.1013230
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1013230. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.