Author
Listed:
- Yun Zuo
- Xingze Fang
- Jiayong Wan
- Wenying He
- Xiangrong Liu
- Xiangxiang Zeng
- Zhaohong Deng
Abstract
The translated protein undergoes a specific modification process, which involves the formation of covalent bonds on lysine residues and the attachment of small chemical moieties. The protein’s fundamental physicochemical properties undergo a significant alteration. The change significantly alters the proteins’ 3D structure and activity, enabling them to modulate key physiological processes. The modulation encompasses inhibiting cancer cell growth, delaying ovarian aging, regulating metabolic diseases, and ameliorating depression. Consequently, the identification and comprehension of post-translational lysine modifications hold substantial value in the realms of biological research and drug development. Post-translational modifications (PTMs) at lysine (K) sites are among the most common protein modifications. However, research on K-PTMs has been largely centered on identifying individual modification types, with a relative scarcity of balanced data analysis techniques. In this study, a classification system is developed for the prediction of concurrent multiple modifications at a single lysine residue. Initially, a well-established multi-label position-specific triad amino acid propensity algorithm is utilized for feature encoding. Subsequently, PreMLS: a novel ClusterCentroids undersampling algorithm based on MiniBatchKmeans was introduced to eliminate redundant or similar major class samples, thereby mitigating the issue of class imbalance. A convolutional neural network architecture was specifically constructed for the analysis of biological sequences to predict multiple lysine modification sites. The model, evaluated through five-fold cross-validation and independent testing, was found to significantly outperform existing models such as iMul-kSite and predML-Site. The results presented here aid in prioritizing potential lysine modification sites, facilitating subsequent biological assays and advancing pharmaceutical research. To enhance accessibility, an open-access predictive script has been crafted for the multi-label predictive model developed in this study.Author summary: Proteins undergo a variety of post-translational modifications (PTMs) after synthesis, such as lysine modifications, which significantly influence their structure and function. These modifications of lysine are known to regulate physiological processes, including the inhibition of cancer cell growth, the delay of aging, the regulation of metabolic diseases, and the improvement of depressive disorders. Abnormal modifications are closely associated with the occurrence and progression of a multitude of diseases. Therefore, the identification and comprehension of these modifications are of paramount importance for biological research and drug development. A multitude of studies have focused on a single type of lysine modification, with prediction methods for multiple lysine modification sites being relatively scarce. In this research, a novel multi-label prediction model named PreMLS has been developed for the simultaneous identification of four lysine modifications: methylation, acetylation, crotonylation, and succinylation. The imbalance issue in the dataset was addressed utilizing the ClusterCentroids undersampling algorithm, following which a predictive model, PreMLS, was constructed using a CNN to forecast multiple lysine modification sites. Compared to existing models, this new approach has significantly enhanced the accuracy and reliability of the predictions.
Suggested Citation
Yun Zuo & Xingze Fang & Jiayong Wan & Wenying He & Xiangrong Liu & Xiangxiang Zeng & Zhaohong Deng, 2024.
"PreMLS: The undersampling technique based on ClusterCentroids to predict multiple lysine sites,"
PLOS Computational Biology, Public Library of Science, vol. 20(10), pages 1-21, October.
Handle:
RePEc:plo:pcbi00:1012544
DOI: 10.1371/journal.pcbi.1012544
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1012544. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.