IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v125y2020i2d10.1007_s11192-020-03614-2.html
   My bibliography  Save this article

Evaluating human versus machine learning performance in classifying research abstracts

Author

Listed:
  • Yeow Chong Goh

    (Nanyang Technological University)

  • Xin Qing Cai

    (Nanyang Technological University)

  • Walter Theseira

    (Singapore University of Social Sciences)

  • Giovanni Ko

    (Singapore Management University)

  • Khiam Aik Khor

    (Nanyang Technological University)

Abstract

We study whether humans or machine learning (ML) classification models are better at classifying scientific research abstracts according to a fixed set of discipline groups. We recruit both undergraduate and postgraduate assistants for this task in separate stages, and compare their performance against the support vectors machine ML algorithm at classifying European Research Council Starting Grant project abstracts to their actual evaluation panels, which are organised by discipline groups. On average, ML is more accurate than human classifiers, across a variety of training and test datasets, and across evaluation panels. ML classifiers trained on different training sets are also more reliable than human classifiers, meaning that different ML classifiers are more consistent in assigning the same classifications to any given abstract, compared to different human classifiers. While the top five percentile of human classifiers can outperform ML in limited cases, selection and training of such classifiers is likely costly and difficult compared to training ML models. Our results suggest ML models are a cost effective and highly accurate method for addressing problems in comparative bibliometric analysis, such as harmonising the discipline classifications of research from different funding agencies or countries.

Suggested Citation

  • Yeow Chong Goh & Xin Qing Cai & Walter Theseira & Giovanni Ko & Khiam Aik Khor, 2020. "Evaluating human versus machine learning performance in classifying research abstracts," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1197-1212, November.
  • Handle: RePEc:spr:scient:v:125:y:2020:i:2:d:10.1007_s11192-020-03614-2
    DOI: 10.1007/s11192-020-03614-2
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-020-03614-2
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-020-03614-2?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Xinhai Liu & Shi Yu & Frizo Janssens & Wolfgang Glänzel & Yves Moreau & Bart De Moor, 2010. "Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 61(6), pages 1105-1119, June.
    2. Leah G. Nichols, 2014. "A topic model approach to measuring interdisciplinarity at the National Science Foundation," Scientometrics, Springer;Akadémiai Kiadó, vol. 100(3), pages 741-754, September.
    3. Chyi-Kwei Yau & Alan Porter & Nils Newman & Arho Suominen, 2014. "Clustering scientific documents with topic modeling," Scientometrics, Springer;Akadémiai Kiadó, vol. 100(3), pages 767-786, September.
    4. Xinhai Liu & Wolfgang Glänzel & Bart Moor, 2012. "Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping," Scientometrics, Springer;Akadémiai Kiadó, vol. 91(2), pages 473-493, May.
    5. Robert R. Braam & Henk F. Moed & Anthony F. J. van Raan, 1991. "Mapping of science by combined co‐citation and word analysis. I. Structural aspects," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 42(4), pages 233-251, May.
    6. Henry Small, 1973. "Co‐citation in the scientific literature: A new measure of the relationship between two documents," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 24(4), pages 265-269, July.
    7. David A. King, 2004. "The scientific impact of nations," Nature, Nature, vol. 430(6997), pages 311-316, July.
    8. Xinhai Liu & Shi Yu & Frizo Janssens & Wolfgang Glänzel & Yves Moreau & Bart De Moor, 2010. "Weighted hybrid clustering by combining text mining and bibliometrics on a large‐scale journal database," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 61(6), pages 1105-1119, June.
    9. Moed, Henk F., 2010. "Measuring contextual citation impact of scientific journals," Journal of Informetrics, Elsevier, vol. 4(3), pages 265-277.
    10. Fredrik Niclas Piro & Dag W. Aksnes & Kristoffer Rørstad, 2013. "A macro analysis of productivity differences across fields: Challenges in the measurement of scientific publishing," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 64(2), pages 307-320, February.
    11. Robert R. Braam & Henk F. Moed & Anthony F. J. van Raan, 1991. "Mapping of science by combined co‐citation and word analysis. II: Dynamical aspects," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 42(4), pages 252-266, May.
    12. Fredrik Niclas Piro & Dag W. Aksnes & Kristoffer Rørstad, 2013. "A macro analysis of productivity differences across fields: Challenges in the measurement of scientific publishing," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 64(2), pages 307-320, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. A. M. Soehartono & L. G. Yu & K. A. Khor, 2022. "Essential signals in publication trends and collaboration patterns in global Research Integrity and Research Ethics (RIRE)," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7487-7497, December.
    2. Guo Chen & Jing Chen & Yu Shao & Lu Xiao, 2023. "Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(2), pages 1187-1204, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ying Huang & Wolfgang Glänzel & Lin Zhang, 2021. "Tracing the development of mapping knowledge domains," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 6201-6224, July.
    2. Dejian Yu & Wanru Wang & Shuai Zhang & Wenyu Zhang & Rongyu Liu, 2017. "Hybrid self-optimized clustering model based on citation links and textual features to detect research topics," PLOS ONE, Public Library of Science, vol. 12(10), pages 1-21, October.
    3. Ding, Ying, 2011. "Community detection: Topological vs. topical," Journal of Informetrics, Elsevier, vol. 5(4), pages 498-514.
    4. Xie, Qing & Zhang, Xinyuan & Song, Min, 2021. "A network embedding-based scholar assessment indicator considering four facets: Research topic, author credit allocation, field-normalized journal impact, and published time," Journal of Informetrics, Elsevier, vol. 15(4).
    5. Rey-Long Liu, 2015. "Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles," PLOS ONE, Public Library of Science, vol. 10(10), pages 1-22, October.
    6. Mora, Luca & Deakin, Mark & Reid, Alasdair, 2019. "Combining co-citation clustering and text-based analysis to reveal the main development paths of smart cities," Technological Forecasting and Social Change, Elsevier, vol. 142(C), pages 56-69.
    7. Wullum Nielsen, Mathias & Börjeson, Love, 2019. "Gender diversity in the management field: Does it matter for research outcomes?," Research Policy, Elsevier, vol. 48(7), pages 1617-1632.
    8. Hyeonchae Yang & Woo-Sung Jung, 2015. "A strategic management approach for Korean public research institutes based on bibliometric investigation," Quality & Quantity: International Journal of Methodology, Springer, vol. 49(4), pages 1437-1464, July.
    9. MaruÅ¡a Premru & Matej ÄŒerne & SaÅ¡a BatistiÄ, 2022. "The Road to the Future: A Multi-Technique Bibliometric Review and Development Projections of the Leader–Member Exchange (LMX) Research," SAGE Open, , vol. 12(2), pages 21582440221, May.
    10. Trappey, Amy J.C. & Wei, Ann Y.E. & Chen, Neil K.T. & Li, Kuo-An & Hung, L.P. & Trappey, Charles V., 2023. "Patent landscape and key technology interaction roadmap using graph convolutional network – Case of mobile communication technologies beyond 5G," Journal of Informetrics, Elsevier, vol. 17(1).
    11. Kamal Sanguri & Atanu Bhuyan & Sabyasachi Patra, 2020. "A semantic similarity adjusted document co-citation analysis: a case of tourism supply chain," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 233-269, October.
    12. Bonaccorsi, Andrea & Belingheri, Paola & Secondi, Luca, 2021. "The research productivity of universities. A multilevel and multidisciplinary analysis on European institutions," Journal of Informetrics, Elsevier, vol. 15(2).
    13. David Chavalarias & Quentin Lobbé & Alexandre Delanoë, 2022. "Draw me Science," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(1), pages 545-575, January.
    14. David Chavalarias & Jean-Philippe Cointet, 2008. "Bottom-up scientific field detection for dynamical and hierarchical science mapping, methodology and case study," Scientometrics, Springer;Akadémiai Kiadó, vol. 75(1), pages 37-50, April.
    15. McLevey, John & McIlroy-Young, Reid, 2017. "Introducing metaknowledge: Software for computational research in information science, network analysis, and science of science," Journal of Informetrics, Elsevier, vol. 11(1), pages 176-197.
    16. Guan-Can Yang & Gang Li & Chun-Ya Li & Yun-Hua Zhao & Jing Zhang & Tong Liu & Dar-Zen Chen & Mu-Hsuan Huang, 2015. "Using the comprehensive patent citation network (CPC) to evaluate patent value," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(3), pages 1319-1346, December.
    17. Gad Yair & Keith Goldstein & Nir Rotem & Anthony J. Olejniczak, 2022. "The three cultures in American science: publication productivity in physics, history and economics," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(6), pages 2967-2980, June.
    18. Xinhai Liu & Wolfgang Glänzel & Bart De Moor, 2011. "Hybrid clustering of multi-view data via Tucker-2 model and its application," Scientometrics, Springer;Akadémiai Kiadó, vol. 88(3), pages 819-839, September.
    19. Marek Kwiek, 2018. "High research productivity in vertically undifferentiated higher education systems: Who are the top performers?," Scientometrics, Springer;Akadémiai Kiadó, vol. 115(1), pages 415-462, April.
    20. Michel Zitt, 2015. "Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields delineation," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2223-2245, March.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:125:y:2020:i:2:d:10.1007_s11192-020-03614-2. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.