IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0276539.html
   My bibliography  Save this article

A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art

Author

Listed:
  • Alicia Lara-Clares
  • Juan J Lastra-Díaz
  • Ana Garcia-Serrano

Abstract

This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. Our reproducible experimental survey is based on a single software platform, which is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods, and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure establishes the new state of the art in sentence similarity analysis in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool for ontology-based methods, have a very significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and highlight the need to refine the current benchmarks. Finally, a notable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning (ML) models evaluated herein.

Suggested Citation

  • Alicia Lara-Clares & Juan J Lastra-Díaz & Ana Garcia-Serrano, 2022. "A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art," PLOS ONE, Public Library of Science, vol. 17(11), pages 1-44, November.
  • Handle: RePEc:plo:pone00:0276539
    DOI: 10.1371/journal.pone.0276539
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276539
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0276539&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0276539?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Alicia Lara-Clares & Juan J Lastra-Díaz & Ana Garcia-Serrano, 2021. "Protocol for a reproducible experimental survey on biomedical sentence similarity," PLOS ONE, Public Library of Science, vol. 16(3), pages 1-28, March.
    2. Yue Shang & Yanpeng Li & Hongfei Lin & Zhihao Yang, 2011. "Enhancing Biomedical Text Summarization Using Semantic Relation Extraction," PLOS ONE, Public Library of Science, vol. 6(8), pages 1-10, August.
    3. Hamed Hassanzadeh & Tudor Groza & Anthony Nguyen & Jane Hunter, 2015. "A Supervised Approach to Quantifying Sentence Similarity: With Application to Evidence Based Medicine," PLOS ONE, Public Library of Science, vol. 10(6), pages 1-25, June.
    4. Kevin W Boyack & David Newman & Russell J Duhon & Richard Klavans & Michael Patek & Joseph R Biberstine & Bob Schijvenaars & André Skupin & Nianli Ma & Katy Börner, 2011. "Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches," PLOS ONE, Public Library of Science, vol. 6(3), pages 1-11, March.
    5. Haibin Liu & Lawrence Hunter & Vlado Kešelj & Karin Verspoor, 2013. "Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations," PLOS ONE, Public Library of Science, vol. 8(4), pages 1-16, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Alicia Lara-Clares & Juan J Lastra-Díaz & Ana Garcia-Serrano, 2021. "Protocol for a reproducible experimental survey on biomedical sentence similarity," PLOS ONE, Public Library of Science, vol. 16(3), pages 1-28, March.
    2. Peter Sjögårde & Fereshteh Didegah, 2022. "The association between topic growth and citation impact of research publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(4), pages 1903-1921, April.
    3. Paul Donner, 2021. "Validation of the Astro dataset clustering solutions with external data," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1619-1645, February.
    4. Lin Zhang & Beibei Sun & Fei Shu & Ying Huang, 2022. "Comparing paper level classifications across different methods and systems: an investigation of Nature publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7633-7651, December.
    5. Mike Thelwall & Stephen Pinfield, 2024. "The accuracy of field classifications for journals in Scopus," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(2), pages 1097-1117, February.
    6. Manuel A. Vázquez & Jorge Pereira-Delgado & Jesús Cid-Sueiro & Jerónimo Arenas-García, 2022. "Validation of scientific topic models using graph analysis and corpus metadata," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5441-5458, September.
    7. Lovro Šubelj & Nees Jan van Eck & Ludo Waltman, 2016. "Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods," PLOS ONE, Public Library of Science, vol. 11(4), pages 1-23, April.
    8. Ballester, Omar & Penner, Orion, 2022. "Robustness, replicability and scalability in topic modelling," Journal of Informetrics, Elsevier, vol. 16(1).
    9. Milad Dehghani & Ki Joon Kim, 2019. "Past and Present Research on Wearable Technologies: Bibliometric and Cluster Analyses of Published Research from 2000 to 2016," International Journal of Innovation and Technology Management (IJITM), World Scientific Publishing Co. Pte. Ltd., vol. 16(01), pages 1-21, February.
    10. Juste Raimbault, 2019. "Exploration of an interdisciplinary scientific landscape," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 617-641, May.
    11. Renchu Guan & Chen Yang & Maurizio Marchese & Yanchun Liang & Xiaohu Shi, 2014. "Full Text Clustering and Relationship Network Analysis of Biomedical Publications," PLOS ONE, Public Library of Science, vol. 9(9), pages 1-9, September.
    12. repec:plo:pone00:0104244 is not listed on IDEAS
    13. Xuejian Huang & Zhibin Wu & Gensheng Wang & Zhipeng Li & Yuansheng Luo & Xiaofang Wu, 2024. "ResGAT: an improved graph neural network based on multi-head attention mechanism and residual network for paper classification," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(2), pages 1015-1036, February.
    14. Francesco Giovanni Avallone & Alberto Quagli & Paola Ramassa, 2022. "Interdisciplinary research by accounting scholars: An exploratory study," FINANCIAL REPORTING, FrancoAngeli Editore, vol. 2022(2), pages 5-34.
    15. Michael Rennings & Philipp Baaden & Carolin Block & Marcus John & Stefanie Bröring, 2024. "Assessing emerging sustainability-oriented technologies: the case of precision agriculture," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(6), pages 2969-2998, June.
    16. Yun, Jinhyuk, 2022. "Generalization of bibliographic coupling and co-citation using the node split network," Journal of Informetrics, Elsevier, vol. 16(2).
    17. Rey-Long Liu, 2015. "Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles," PLOS ONE, Public Library of Science, vol. 10(10), pages 1-22, October.
    18. Hanwen Xu & Addie Woicik & Hoifung Poon & Russ B. Altman & Sheng Wang, 2023. "Multilingual translation for zero-shot biomedical classification using BioTranslator," Nature Communications, Nature, vol. 14(1), pages 1-13, December.
    19. Xu, Shuo & Hao, Liyuan & An, Xin & Yang, Guancan & Wang, Feifei, 2019. "Emerging research topics detection with multiple machine learning models," Journal of Informetrics, Elsevier, vol. 13(4).
    20. Sitaram Devarakonda & Dmitriy Korobskiy & Tandy Warnow & George Chacko, 2020. "Viewing computer science through citation analysis: Salton and Bergmark Redux," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 271-287, October.
    21. Urdiales, Cristina & Guzmán, Eduardo, 2024. "An automatic and association-based procedure for hierarchical publication subject categorization," Journal of Informetrics, Elsevier, vol. 18(1).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0276539. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.