IDEAS home Printed from
   My bibliography  Save this article

Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning


  • Guo Chen

    (Nanjing University of Science and Technology)

  • Jing Chen

    (Nanjing University of Science and Technology)

  • Yu Shao

    (Northwest Engineering Corporation Limited)

  • Lu Xiao

    (Nanjing University of Finance and Economics)


Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.

Suggested Citation

  • Guo Chen & Jing Chen & Yu Shao & Lu Xiao, 2023. "Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(2), pages 1187-1204, February.
  • Handle: RePEc:spr:scient:v:128:y:2023:i:2:d:10.1007_s11192-022-04598-x
    DOI: 10.1007/s11192-022-04598-x

    Download full text from publisher

    File URL:
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL:
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    1. Chen, Guo & Xiao, Lu, 2016. "Selecting publication keywords for domain analysis in bibliometrics: A comparison of three methods," Journal of Informetrics, Elsevier, vol. 10(1), pages 212-223.
    2. Xin An & Xin Sun & Shuo Xu, 2022. "Important citations identification with semi-supervised classification model," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6533-6555, November.
    3. Xinhai Liu & Wolfgang Glänzel & Bart Moor, 2012. "Optimal and hierarchical clustering of large-scale hybrid networks for scientific mapping," Scientometrics, Springer;Akadémiai Kiadó, vol. 91(2), pages 473-493, May.
    4. Ali Najmi & Taha H. Rashidi & Alireza Abbasi & S. Travis Waller, 2017. "Reviewing the transport domain: an evolutionary bibliometrics and network analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(2), pages 843-865, February.
    5. Lu, Chao & Bu, Yi & Dong, Xianlei & Wang, Jie & Ding, Ying & Larivière, Vincent & Sugimoto, Cassidy R. & Paul, Logan & Zhang, Chengzhi, 2019. "Analyzing linguistic complexity and scientific impact," Journal of Informetrics, Elsevier, vol. 13(3), pages 817-829.
    6. Yeow Chong Goh & Xin Qing Cai & Walter Theseira & Giovanni Ko & Khiam Aik Khor, 2020. "Evaluating human versus machine learning performance in classifying research abstracts," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1197-1212, November.
    7. Ludo Waltman & Nees Jan Eck, 2012. "A new methodology for constructing a publication-level classification system of science," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(12), pages 2378-2392, December.
    8. Youngjae Choi & Sanghyun Park & Sungjoo Lee, 2021. "Identifying emerging technologies to envision a future innovation ecosystem: A machine learning approach to patent data," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 5431-5476, July.
    9. Shu, Fei & Julien, Charles-Antoine & Zhang, Lin & Qiu, Junping & Zhang, Jing & Larivière, Vincent, 2019. "Comparing journal and paper level classifications of science," Journal of Informetrics, Elsevier, vol. 13(1), pages 202-225.
    10. Mogoutov, Andrei & Kahane, Bernard, 2007. "Data search strategy for science and technology emergence: A scalable and evolutionary query for nanotechnology tracking," Research Policy, Elsevier, vol. 36(6), pages 893-903, July.
    11. Wolfgang Glänzel & András Schubert, 2003. "A new classification scheme of science fields and subfields designed for scientometric evaluation purposes," Scientometrics, Springer;Akadémiai Kiadó, vol. 56(3), pages 357-367, March.
    12. Yuan Zhou & Heng Lin & Yufei Liu & Wei Ding, 2019. "A novel method to identify emerging technologies using a semi-supervised topic clustering model: a case of 3D printing industry," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(1), pages 167-185, July.
    13. Waleed Iqbal & Junaid Qadir & Gareth Tyson & Adnan Noor Mian & Saeed-ul Hassan & Jon Crowcroft, 2019. "A bibliometric analysis of publications in computer networking research," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 1121-1155, May.
    14. Haiko Lietz, 2020. "Drawing impossible boundaries: field delineation of Social Network Science," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2841-2876, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lin Zhang & Beibei Sun & Fei Shu & Ying Huang, 2022. "Comparing paper level classifications across different methods and systems: an investigation of Nature publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7633-7651, December.
    2. Muñoz-Écija, Teresa & Vargas-Quesada, Benjamín & Chinchilla Rodríguez, Zaida, 2019. "Coping with methods for delineating emerging fields: Nanoscience and nanotechnology as a case study," Journal of Informetrics, Elsevier, vol. 13(4).
    3. Fei Shu & Yue Ma & Junping Qiu & Vincent Larivière, 2020. "Classifications of science and their effects on bibliometric evaluations," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2727-2744, December.
    4. Haiko Lietz, 2020. "Drawing impossible boundaries: field delineation of Social Network Science," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2841-2876, December.
    5. Ying Huang & Wolfgang Glänzel & Lin Zhang, 2021. "Tracing the development of mapping knowledge domains," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 6201-6224, July.
    6. Michel Zitt, 2015. "Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields delineation," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2223-2245, March.
    7. Ruiz-Castillo, Javier & Costas, Rodrigo, 2018. "Individual and field citation distributions in 29 broad scientific fields," Journal of Informetrics, Elsevier, vol. 12(3), pages 868-892.
    8. Gabriele Sampagnaro, 2023. "Keyword occurrences and journal specialization," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(10), pages 5629-5645, October.
    9. Li, Yunrong & Ruiz-Castillo, Javier, 2013. "The comparison of normalization procedures based on different classification systems," Journal of Informetrics, Elsevier, vol. 7(4), pages 945-958.
    10. Ricardo Arencibia-Jorge & Rosa Lidia Vega-Almeida & José Luis Jiménez-Andrade & Humberto Carrillo-Calvet, 2022. "Evolutionary stages and multidisciplinary nature of artificial intelligence research," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5139-5158, September.
    11. Gerson Pech & Catarina Delgado & Silvio Paolo Sorella, 2022. "Classifying papers into subfields using Abstracts, Titles, Keywords and KeyWords Plus through pattern detection and optimization procedures: An application in Physics," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 73(11), pages 1513-1528, November.
    12. Waltman, Ludo, 2016. "A review of the literature on citation impact indicators," Journal of Informetrics, Elsevier, vol. 10(2), pages 365-391.
    13. Chiara Carusi & Giuseppe Bianchi, 2020. "A look at interdisciplinarity using bipartite scholar/journal networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(2), pages 867-894, February.
    14. Shu-Hao Chang, 2018. "A pilot study on the connection between scientific fields and patent classification systems," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(3), pages 951-970, March.
    15. Abramo, Giovanni & D’Angelo, Ciriaco Andrea & Zhang, Lin, 2018. "A comparison of two approaches for measuring interdisciplinary research output: The disciplinary diversity of authors vs the disciplinary diversity of the reference list," Journal of Informetrics, Elsevier, vol. 12(4), pages 1182-1193.
    16. Bordoloi, Tausif & Shapira, Philip & Mativenga, Paul, 2022. "Policy interactions with research trajectories: The case of cyber-physical convergence in manufacturing and industrials," Technological Forecasting and Social Change, Elsevier, vol. 175(C).
    17. Loet Leydesdorff & Lutz Bornmann & Caroline S. Wagner, 2017. "Generating clustered journal maps: an automated system for hierarchical classification," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(3), pages 1601-1614, March.
    18. Juan Miguel Campanario, 2018. "Are leaders really leading? Journals that are first in Web of Science subject categories in the context of their groups," Scientometrics, Springer;Akadémiai Kiadó, vol. 115(1), pages 111-130, April.
    19. Jielan Ding & Per Ahlgren & Liying Yang & Ting Yue, 2018. "Disciplinary structures in Nature, Science and PNAS: journal and country levels," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(3), pages 1817-1852, September.
    20. Carusi, Chiara & Bianchi, Giuseppe, 2019. "Scientific community detection via bipartite scholar/journal graph co-clustering," Journal of Informetrics, Elsevier, vol. 13(1), pages 354-386.


    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:128:y:2023:i:2:d:10.1007_s11192-022-04598-x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.