IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v8y2023i2p41-d1072576.html
   My bibliography  Save this article

VPAgs-Dataset4ML: A Dataset to Predict Viral Protective Antigens for Machine Learning-Based Reverse Vaccinology

Author

Listed:
  • Zakia Salod

    (Discipline of Public Health Medicine, University of KwaZulu-Natal, Durban 4051, South Africa)

  • Ozayr Mahomed

    (Discipline of Public Health Medicine, University of KwaZulu-Natal, Durban 4051, South Africa
    Dasman Diabetes Institute, P.O. Box 1180, Dasman 15462, Kuwait City, Kuwait)

Abstract

Reverse vaccinology (RV) is a computer-aided approach for vaccine development that identifies a subset of pathogen proteins as protective antigens (PAgs) or potential vaccine candidates. Machine learning (ML)-based RV is promising, but requires a dataset of PAgs (positives) and non-protective protein sequences (negatives). This study aimed to create an ML dataset, VPAgs-Dataset4ML, to predict viral PAgs based on PAgs obtained from Protegen. We performed seven steps to identify PAgs from the Protegen website and non-protective protein sequences from Universal Protein Resource (UniProt). The seven steps included downloading viral PAgs from Protegen, performing quality checks on PAgs using the standard BLASTp identity check ≤30% via MMseqs2, and computational steps running on Google Colaboratory and the Ubuntu terminal to retrieve and perform quality checks (similar to the PAgs) on non-protective protein sequences as negatives from UniProt. VPAgs-Dataset4ML contains 2145 viral protein sequences, with 210 PAgs in positive.fasta and 1935 non-protective protein sequences in negative.fasta . This dataset can be used to train ML models to predict antigens for various viral pathogens with the aim of developing effective vaccines.

Suggested Citation

  • Zakia Salod & Ozayr Mahomed, 2023. "VPAgs-Dataset4ML: A Dataset to Predict Viral Protective Antigens for Machine Learning-Based Reverse Vaccinology," Data, MDPI, vol. 8(2), pages 1-12, February.
  • Handle: RePEc:gam:jdataj:v:8:y:2023:i:2:p:41-:d:1072576
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/8/2/41/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/8/2/41/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Kate E. Jones & Nikkita G. Patel & Marc A. Levy & Adam Storeygard & Deborah Balk & John L. Gittleman & Peter Daszak, 2008. "Global trends in emerging infectious diseases," Nature, Nature, vol. 451(7181), pages 990-993, February.
    2. Kuhn, Max, 2008. "Building Predictive Models in R Using the caret Package," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 28(i05).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Nikolett Orosz & Tünde Tóthné Tóth & Gyöngyi Vargáné Gyuró & Zsoltné Tibor Nábrádi & Klára Hegedűsné Sorosi & Zsuzsa Nagy & Éva Rigó & Ádám Kaposi & Gabriella Gömöri & Cornelia Melinda Adi Santoso & A, 2022. "Comparison of Length of Hospital Stay for Community-Acquired Infections Due to Enteric Pathogens, Influenza Viruses and Multidrug-Resistant Bacteria: A Cross-Sectional Study in Hungary," IJERPH, MDPI, vol. 19(23), pages 1-16, November.
    2. Mudassar Arsalan & Omar Mubin & Fady Alnajjar & Belal Alsinglawi, 2020. "COVID-19 Global Risk: Expectation vs. Reality," IJERPH, MDPI, vol. 17(15), pages 1-10, August.
    3. Prabal Das & D. A. Sachindra & Kironmala Chanda, 2022. "Machine Learning-Based Rainfall Forecasting with Multiple Non-Linear Feature Selection Algorithms," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 36(15), pages 6043-6071, December.
    4. Jie Zhao & Ji Chen & Damien Beillouin & Hans Lambers & Yadong Yang & Pete Smith & Zhaohai Zeng & Jørgen E. Olesen & Huadong Zang, 2022. "Global systematic review with meta-analysis reveals yield advantage of legume-based rotations and its drivers," Nature Communications, Nature, vol. 13(1), pages 1-9, December.
    5. Piaopiao Chen & Agnès H. Michel & Jianzhi Zhang, 2022. "Transposon insertional mutagenesis of diverse yeast strains suggests coordinated gene essentiality polymorphisms," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    6. Paulo Infante & Gonçalo Jacinto & Anabela Afonso & Leonor Rego & Pedro Nogueira & Marcelo Silva & Vitor Nogueira & José Saias & Paulo Quaresma & Daniel Santos & Patrícia Góis & Paulo Rebelo Manuel, 2023. "Factors That Influence the Type of Road Traffic Accidents: A Case Study in a District of Portugal," Sustainability, MDPI, vol. 15(3), pages 1-16, January.
    7. Ephrem Habyarimana & Faheem S Baloch, 2021. "Machine learning models based on remote and proximal sensing as potential methods for in-season biomass yields prediction in commercial sorghum fields," PLOS ONE, Public Library of Science, vol. 16(3), pages 1-23, March.
    8. Banks, Jonathan & Rabbani, Arif & Nadkarni, Kabir & Renaud, Evan, 2020. "Estimating parasitic loads related to brine production from a hot sedimentary aquifer geothermal project: A case study from the Clarke Lake gas field, British Columbia," Renewable Energy, Elsevier, vol. 153(C), pages 539-552.
    9. Ceddia, M.G. & Bardsley, N.O. & Goodwin, R. & Holloway, G.J. & Nocella, G. & Stasi, A., 2013. "A complex system perspective on the emergence and spread of infectious diseases: Integrating economic and ecological aspects," Ecological Economics, Elsevier, vol. 90(C), pages 124-131.
    10. John M Drake & Tobias S Brett & Shiyang Chen & Bogdan I Epureanu & Matthew J Ferrari & Éric Marty & Paige B Miller & Eamon B O’Dea & Suzanne M O’Regan & Andrew W Park & Pejman Rohani, 2019. "The statistics of epidemic transitions," PLOS Computational Biology, Public Library of Science, vol. 15(5), pages 1-14, May.
    11. Ongolo, Symphorien & Giessen, Lukas & Karsenty, Alain & Tchamba, Martin & Krott, Max, 2021. "Forestland policies and politics in Africa: Recent evidence and new challenges," Forest Policy and Economics, Elsevier, vol. 127(C).
    12. Alexander Wettstein & Gabriel Jenni & Ida Schneider & Fabienne Kühne & Martin grosse Holtforth & Roberto La Marca, 2023. "Predictors of Psychological Strain and Allostatic Load in Teachers: Examining the Long-Term Effects of Biopsychosocial Risk and Protective Factors Using a LASSO Regression Approach," IJERPH, MDPI, vol. 20(10), pages 1-20, May.
    13. Tang, Kayu & Parsons, David J. & Jude, Simon, 2019. "Comparison of automatic and guided learning for Bayesian networks to analyse pipe failures in the water distribution system," Reliability Engineering and System Safety, Elsevier, vol. 186(C), pages 24-36.
    14. Paige, Sarah B. & Malavé, Carly & Mbabazi, Edith & Mayer, Jonathan & Goldberg, Tony L., 2015. "Uncovering zoonoses awareness in an emerging disease ‘hotspot’," Social Science & Medicine, Elsevier, vol. 129(C), pages 78-86.
    15. Jianhua Wang & Guan-Zhu Han, 2023. "Genome mining shows that retroviruses are pervasively invading vertebrate genomes," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    16. Livia Marchetti & Valentina Cattivelli & Claudia Cocozza & Fabio Salbitano & Marco Marchetti, 2020. "Beyond Sustainability in Food Systems: Perspectives from Agroecology and Social Innovation," Sustainability, MDPI, vol. 12(18), pages 1-24, September.
    17. Daifeng Xiang & Gangsheng Wang & Jing Tian & Wanyu Li, 2023. "Global patterns and edaphic-climatic controls of soil carbon decomposition kinetics predicted from incubation experiments," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    18. Ivan Montiel & Junghoon Park & Bryan W. Husted & Andres Velez-Calle, 2022. "Tracing the connections between international business and communicable diseases," Journal of International Business Studies, Palgrave Macmillan;Academy of International Business, vol. 53(8), pages 1785-1804, October.
    19. Maxwell B Joseph & William E Stutz & Pieter T J Johnson, 2016. "Multilevel Models for the Distribution of Hosts and Symbionts," PLOS ONE, Public Library of Science, vol. 11(11), pages 1-15, November.
    20. Bellotti, Anthony & Brigo, Damiano & Gambetti, Paolo & Vrins, Frédéric, 2021. "Forecasting recovery rates on non-performing loans with machine learning," International Journal of Forecasting, Elsevier, vol. 37(1), pages 428-444.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:8:y:2023:i:2:p:41-:d:1072576. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.