IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v130y2025i1d10.1007_s11192-024-05217-7.html
   My bibliography  Save this article

Journal article classification using abstracts: a comparison of classical and transformer-based machine learning methods

Author

Listed:
  • Cristina Arhiliuc

    (University of Antwerp)

  • Raf Guns

    (University of Antwerp)

  • Walter Daelemans

    (University of Antwerp)

  • Tim C. E. Engels

    (University of Antwerp)

Abstract

In this article we analyze the performance of existing models to classify journal articles into disciplines from a predefined classification scheme (i.e., supervised learning), based on their abstract. The first part analyzes scenarios with ample labeled data, comparing the performance of the Support Vector Machine algorithm (SVM) combined with TF-IDF and with SPECTER embeddings (Cohan et al. SPECTER: Document-level representation learning using citation-informed transformers, https://doi.org/10.48550/arXiv.2004.07180 , 2020) and Bidirectional Encoder Representations from Transformers (BERT) models. The second part employes Generative Pre-trained Transformer model 3.5 turbo (GPT-3.5-turbo) for the zero- and few-shot learning situations. Through the use of GPT-3.5-turbo we examine how different characterizations of disciplines (such as names, descriptions, and examples) affect the model’s ability to classify articles. The data set comprises journal articles published in 2022 and indexed in the Web of Science, with subject categories aligned to a modified version of the OECD Fields of Research and Development (FoRD) classification scheme. We find that BERT models surpass the SVM + TF-IDF baseline and SVM + SPECTER in all areas. For all disciplinary areas except Humanities, we observe minimal variation among the models fine-tuned on larger datasets, and greater variability with smaller training datasets. The GPT 3.5-turbo results show significant fluctuations across disciplines, influenced by the clarity of their definition and their distinctiveness as research topics compared to other fields. Although the two approaches are not directly comparable, we conclude that the classification models show promising results in their specific scenarios, with variations across disciplines.

Suggested Citation

  • Cristina Arhiliuc & Raf Guns & Walter Daelemans & Tim C. E. Engels, 2025. "Journal article classification using abstracts: a comparison of classical and transformer-based machine learning methods," Scientometrics, Springer;Akadémiai Kiadó, vol. 130(1), pages 313-342, January.
  • Handle: RePEc:spr:scient:v:130:y:2025:i:1:d:10.1007_s11192-024-05217-7
    DOI: 10.1007/s11192-024-05217-7
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-024-05217-7
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-024-05217-7?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Wang, Qi & Waltman, Ludo, 2016. "Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus," Journal of Informetrics, Elsevier, vol. 10(2), pages 347-364.
    2. Xiaoming Huang & Peihu Zhu & Yuwen Chen & Jian Ma, 2023. "A transfer learning approach to interdisciplinary document classification with keyword-based explanation," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(12), pages 6449-6469, December.
    3. Lin Zhang & Beibei Sun & Zaida Chinchilla-Rodríguez & Lixin Chen & Ying Huang, 2018. "Interdisciplinarity and collaboration: on the relationship between disciplinary diversity in departmental affiliations and reference lists," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 271-291, October.
    4. Ludo Waltman & Nees Jan Eck, 2012. "A new methodology for constructing a publication-level classification system of science," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(12), pages 2378-2392, December.
    5. Richard Klavans & Kevin W. Boyack, 2017. "Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge?," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 68(4), pages 984-998, April.
    6. Raf Guns & Linda Sīle & Joshua Eykens & Frederik T. Verleysen & Tim C. E. Engels, 2018. "A comparison of cognitive and organizational classification of publications in the social sciences and humanities," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1093-1111, August.
    7. Small, Henry & Boyack, Kevin W. & Klavans, Richard, 2014. "Identifying emerging topics in science and technology," Research Policy, Elsevier, vol. 43(8), pages 1450-1467.
    8. Abramo, Giovanni & D’Angelo, Ciriaco Andrea & Zhang, Lin, 2018. "A comparison of two approaches for measuring interdisciplinary research output: The disciplinary diversity of authors vs the disciplinary diversity of the reference list," Journal of Informetrics, Elsevier, vol. 12(4), pages 1182-1193.
    9. Cristina Arhiliuc & Raf Guns, 2023. "Disciplinary collaboration rates in the social sciences and humanities: what is the influence of classification type?," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(6), pages 3419-3436, June.
    10. Baccini, Federica & Barabesi, Lucio & Baccini, Alberto & Khelfaoui, Mahdi & Gingras, Yves, 2022. "Similarity network fusion for scholarly journals," Journal of Informetrics, Elsevier, vol. 16(1).
    11. Ludo Waltman & Nees Jan van Eck, 2012. "A new methodology for constructing a publication‐level classification system of science," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 63(12), pages 2378-2392, December.
    12. Yeow Chong Goh & Xin Qing Cai & Walter Theseira & Giovanni Ko & Khiam Aik Khor, 2020. "Evaluating human versus machine learning performance in classifying research abstracts," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1197-1212, November.
    13. Urdiales, Cristina & Guzmán, Eduardo, 2024. "An automatic and association-based procedure for hierarchical publication subject categorization," Journal of Informetrics, Elsevier, vol. 18(1).
    14. Shu, Fei & Julien, Charles-Antoine & Zhang, Lin & Qiu, Junping & Zhang, Jing & Larivière, Vincent, 2019. "Comparing journal and paper level classifications of science," Journal of Informetrics, Elsevier, vol. 13(1), pages 202-225.
    15. Gerson Pech & Catarina Delgado & Silvio Paolo Sorella, 2022. "Classifying papers into subfields using Abstracts, Titles, Keywords and KeyWords Plus through pattern detection and optimization procedures: An application in Physics," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 73(11), pages 1513-1528, November.
    16. W. Glänzel & A. Schubert & H. -J. Czerwon, 1999. "An item-by-item subject classification of papers published in multidisciplinary and general journals using reference analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 44(3), pages 427-439, March.
    17. Lutz Bornmann, 2018. "Field classification of publications in Dimensions: a first case study testing its reliability and validity," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 637-640, October.
    18. Fei Shu & Yue Ma & Junping Qiu & Vincent Larivière, 2020. "Classifications of science and their effects on bibliometric evaluations," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2727-2744, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lin Zhang & Beibei Sun & Fei Shu & Ying Huang, 2022. "Comparing paper level classifications across different methods and systems: an investigation of Nature publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7633-7651, December.
    2. Gabriele Sampagnaro, 2023. "Keyword occurrences and journal specialization," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(10), pages 5629-5645, October.
    3. Gerson Pech & Catarina Delgado & Silvio Paolo Sorella, 2022. "Classifying papers into subfields using Abstracts, Titles, Keywords and KeyWords Plus through pattern detection and optimization procedures: An application in Physics," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 73(11), pages 1513-1528, November.
    4. Alfonso Ávila-Robinson & Cristian Mejia & Shintaro Sengoku, 2021. "Are bibliometric measures consistent with scientists’ perceptions? The case of interdisciplinarity in research," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(9), pages 7477-7502, September.
    5. Jiandong Zhang & Zhesi Shen, 2024. "Analyzing journal category assignment using a paper-level classification system: multidisciplinary sciences journals," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(10), pages 5963-5978, October.
    6. Michael Gusenbauer, 2022. "Search where you will find most: Comparing the disciplinary coverage of 56 bibliographic databases," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(5), pages 2683-2745, May.
    7. Sjögårde, Peter & Ahlgren, Per, 2018. "Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics," Journal of Informetrics, Elsevier, vol. 12(1), pages 133-152.
    8. Sitaram Devarakonda & Dmitriy Korobskiy & Tandy Warnow & George Chacko, 2020. "Viewing computer science through citation analysis: Salton and Bergmark Redux," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 271-287, October.
    9. Ugo Moschini & Elena Fenialdi & Cinzia Daraio & Giancarlo Ruocco & Elisa Molinari, 2020. "A comparison of three multidisciplinarity indices based on the diversity of Scopus subject areas of authors’ documents, their bibliography and their citing papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1145-1158, November.
    10. Maryam Nakhoda & Peter Whigham & Sander Zwanenburg, 2023. "Quantifying and addressing uncertainty in the measurement of interdisciplinarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(11), pages 6107-6127, November.
    11. Li, Menghui & Yang, Liying & Zhang, Huina & Shen, Zhesi & Wu, Chensheng & Wu, Jinshan, 2017. "Do mathematicians, economists and biomedical scientists trace large topics more strongly than physicists?," Journal of Informetrics, Elsevier, vol. 11(2), pages 598-607.
    12. Bornmann, Lutz & Haunschild, Robin, 2022. "Empirical analysis of recent temporal dynamics of research fields: Annual publications in chemistry and related areas as an example," Journal of Informetrics, Elsevier, vol. 16(2).
    13. Matthias Held & Grit Laudel & Jochen Gläser, 2021. "Challenges to the validity of topic reconstruction," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 4511-4536, May.
    14. Waltman, Ludo, 2016. "A review of the literature on citation impact indicators," Journal of Informetrics, Elsevier, vol. 10(2), pages 365-391.
    15. Nees Jan Eck & Ludo Waltman, 2017. "Citation-based clustering of publications using CitNetExplorer and VOSviewer," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1053-1070, May.
    16. Haunschild, Robin & Daniels, Angela D. & Bornmann, Lutz, 2022. "Scores of a specific field-normalized indicator calculated with different approaches of field-categorization: Are the scores different or similar?," Journal of Informetrics, Elsevier, vol. 16(1).
    17. Fang Han & Christopher L. Magee, 2018. "Testing the science/technology relationship by analysis of patent citations of scientific papers after decomposition of both science and technology," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 767-796, August.
    18. Abramo, Giovanni & D’Angelo, Ciriaco Andrea & Zhang, Lin, 2018. "A comparison of two approaches for measuring interdisciplinary research output: The disciplinary diversity of authors vs the disciplinary diversity of the reference list," Journal of Informetrics, Elsevier, vol. 12(4), pages 1182-1193.
    19. Xu, Haiyun & Winnink, Jos & Yue, Zenghui & Zhang, Huiling & Pang, Hongshen, 2021. "Multidimensional Scientometric indicators for the detection of emerging research topics," Technological Forecasting and Social Change, Elsevier, vol. 163(C).
    20. Fei Shu & Yue Ma & Junping Qiu & Vincent Larivière, 2020. "Classifications of science and their effects on bibliometric evaluations," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2727-2744, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:130:y:2025:i:1:d:10.1007_s11192-024-05217-7. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.