IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v127y2022i5d10.1007_s11192-022-04314-9.html
   My bibliography  Save this article

Why was this cited? Explainable machine learning applied to COVID-19 research literature

Author

Listed:
  • Lucie Beranová

    (VSE Praha)

  • Marcin P. Joachimiak

    (Environmental Genomics and Systems Biology Division at Lawrence Berkeley National Laboratory)

  • Tomáš Kliegr

    (VSE Praha)

  • Gollam Rabby

    (VSE Praha)

  • Vilém Sklenák

    (VSE Praha
    VSE Praha)

Abstract

Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles—mostly from biology and medicine—applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by “black-box” machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a “black-box” method—neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.

Suggested Citation

  • Lucie Beranová & Marcin P. Joachimiak & Tomáš Kliegr & Gollam Rabby & Vilém Sklenák, 2022. "Why was this cited? Explainable machine learning applied to COVID-19 research literature," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(5), pages 2313-2349, May.
  • Handle: RePEc:spr:scient:v:127:y:2022:i:5:d:10.1007_s11192-022-04314-9
    DOI: 10.1007/s11192-022-04314-9
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-022-04314-9
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-022-04314-9?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Richard Klavans & Kevin W. Boyack, 2017. "Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge?," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 68(4), pages 984-998, April.
    2. Qi Wang, 2018. "A bibliometric model for identifying emerging research topics," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 69(2), pages 290-304, February.
    3. Nees Jan Eck & Ludo Waltman, 2017. "Citation-based clustering of publications using CitNetExplorer and VOSviewer," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1053-1070, May.
    4. César Muñoz-Fontela & William E. Dowling & Simon G. P. Funnell & Pierre-S. Gsell & A. Ximena Riveros-Balta & Randy A. Albrecht & Hanne Andersen & Ralph S. Baric & Miles W. Carroll & Marco Cavaleri & C, 2020. "Animal models for COVID-19," Nature, Nature, vol. 586(7830), pages 509-515, October.
    5. Vieira, E.S. & Gomes, J.A.N.F., 2010. "Citations to scientific articles: Its distribution and dependence on the article features," Journal of Informetrics, Elsevier, vol. 4(1), pages 1-13.
    6. Patrick Glenisson & Wolfgang Glänzel & Olle Persson, 2005. "Combining full-text analysis and bibliometric indicators. A pilot study," Scientometrics, Springer;Akadémiai Kiadó, vol. 63(1), pages 163-180, March.
    7. Vahe Tshitoyan & John Dagdelen & Leigh Weston & Alexander Dunn & Ziqin Rong & Olga Kononova & Kristin A. Persson & Gerbrand Ceder & Anubhav Jain, 2019. "Unsupervised word embeddings capture latent knowledge from materials science literature," Nature, Nature, vol. 571(7763), pages 95-98, July.
    8. Michael Hahsler & Radoslaw Karpienko, 2017. "Visualizing association rules in hierarchical groups," Journal of Business Economics, Springer, vol. 87(3), pages 317-335, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Peter Sjögårde & Fereshteh Didegah, 2022. "The association between topic growth and citation impact of research publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(4), pages 1903-1921, April.
    2. Paul Donner, 2021. "Validation of the Astro dataset clustering solutions with external data," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1619-1645, February.
    3. Carusi, Chiara & Bianchi, Giuseppe, 2019. "Scientific community detection via bipartite scholar/journal graph co-clustering," Journal of Informetrics, Elsevier, vol. 13(1), pages 354-386.
    4. Yan-Li Liu & Wen-Juan Yuan & Shao-Hong Zhu, 2022. "The state of social science research on COVID-19," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(1), pages 369-383, January.
    5. Shome, Samik & Hassan, M. Kabir & Verma, Sushma & Panigrahi, Tushar Ranjan, 2023. "Impact investment for sustainable development: A bibliometric analysis," International Review of Economics & Finance, Elsevier, vol. 84(C), pages 770-800.
    6. Sukrit Vinayavekhin & Feng Li & Aneesh Banerjee & Andrea Caputo, 2023. "The academic landscape of sustainability in management literature: Towards a more interdisciplinary research agenda," Business Strategy and the Environment, Wiley Blackwell, vol. 32(8), pages 5748-5784, December.
    7. Lu Liu & Benjamin F. Jones & Brian Uzzi & Dashun Wang, 2023. "Data, measurement and empirical methods in the science of science," Nature Human Behaviour, Nature, vol. 7(7), pages 1046-1058, July.
    8. Xu, Haiyun & Winnink, Jos & Yue, Zenghui & Zhang, Huiling & Pang, Hongshen, 2021. "Multidimensional Scientometric indicators for the detection of emerging research topics," Technological Forecasting and Social Change, Elsevier, vol. 163(C).
    9. Lutz Bornmann & Robin Haunschild & Sven E. Hug, 2018. "Visualizing the context of citations referencing papers published by Eugene Garfield: a new type of keyword co-occurrence analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(2), pages 427-437, February.
    10. Ananthan Nambiar & Tobias Rubel & James McCaull & Jon deVries & Mark Bedau, 2021. "Dropping diversity of products of large US firms: Models and measures," Papers 2110.08367, arXiv.org.
    11. Ma, Chao & Li, Yiwei & Guo, Feng & Si, Kao, 2019. "The citation trap: Papers published at year-end receive systematically fewer citations," Journal of Economic Behavior & Organization, Elsevier, vol. 166(C), pages 667-687.
    12. Luis Araya-Castillo & Felipe Hernández-Perlines & Hugo Moraga & Antonio Ariza-Montes, 2021. "Scientometric Analysis of Research on Socioemotional Wealth," Sustainability, MDPI, vol. 13(7), pages 1-26, March.
    13. Gao, Qiang & Liang, Zhentao & Wang, Ping & Hou, Jingrui & Chen, Xiuxiu & Liu, Manman, 2021. "Potential index: Revealing the future impact of research topics based on current knowledge networks," Journal of Informetrics, Elsevier, vol. 15(3).
    14. Takahiro Kawamura & Katsutaro Watanabe & Naoya Matsumoto & Shusaku Egami & Mari Jibu, 2018. "Funding map using paragraph embedding based on semantic diversity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 941-958, August.
    15. Serhat Burmaoglu & Ozcan Saritas, 2019. "An evolutionary analysis of the innovation policy domain: Is there a paradigm shift?," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(3), pages 823-847, March.
    16. Bornmann, Lutz & Leydesdorff, Loet & Wang, Jian, 2014. "How to improve the prediction based on citation impact percentiles for years shortly after the publication date?," Journal of Informetrics, Elsevier, vol. 8(1), pages 175-180.
    17. Cai, Ya-Jun & Lo, Chris K.Y., 2020. "Omni-channel management in the new retailing era: A systematic review and future research agenda," International Journal of Production Economics, Elsevier, vol. 229(C).
    18. Rosa Lombardi & Raffaele Trequattrini & Federico Schimperna & Myriam Cano-Rubio, 2021. "The Impact of Smart Technologies on theManagement and Strategic Control: A Structured Literature Review," MANAGEMENT CONTROL, FrancoAngeli Editore, vol. 2021(suppl. 1), pages 11-30.
    19. Jason Youn & Navneet Rai & Ilias Tagkopoulos, 2022. "Knowledge integration and decision support for accelerated discovery of antibiotic resistance genes," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    20. Kong, Ling & Wang, Dongbo, 2020. "Comparison of citations and attention of cover and non-cover papers," Journal of Informetrics, Elsevier, vol. 14(4).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:127:y:2022:i:5:d:10.1007_s11192-022-04314-9. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.