IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v17y2025i5p195-d1644623.html
   My bibliography  Save this article

An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements

Author

Listed:
  • Tresna Maulana Fahrudin

    (Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)

  • Nobuo Funabiki

    (Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)

  • Komang Candra Brata

    (Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan
    Department of Informatics Engineering, Universitas Brawijaya, Malang 65145, Indonesia)

  • Inzali Naing

    (Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)

  • Soe Thandar Aung

    (Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)

  • Amri Muhaimin

    (Department of Data Science, Universitas Pembangunan Nasional Veteran Jawa Timur, Surabaya 60294, Indonesia)

  • Dwi Arman Prasetya

    (Department of Data Science, Universitas Pembangunan Nasional Veteran Jawa Timur, Surabaya 60294, Indonesia)

Abstract

Nowadays, accessibility to academic papers has been significantly improved with electric publications on the internet, where open access has become common. At the same time, it has increased workloads in literature surveys for researchers who usually manually download PDF files and check their contents. To solve this drawback, we have proposed a reference paper collection system using a web scraping technology and natural language models. However, our previous system often finds a limited number of relevant reference papers after taking long time, since it relies on one paper search website and runs on a single thread at a multi-core CPU. In this paper, we present an improved reference paper collection system with three enhancements to solve them: (1) integrating the APIs from multiple paper search web sites, namely, the bulk search endpoint in the Semantic Scholar API, the article search endpoint in the DOAJ API, and the search and fetch endpoint in the PubMed API to retrieve article metadata, (2) running the program on multiple threads for multi-core CPU, and (3) implementing Dynamic URL Redirection , Regex-based URL Parsing , and HTML Scraping with URL Extraction for fast checking of PDF file accessibility, along with sentence embedding to assess relevance based on semantic similarity. For evaluations, we compare the number of obtained reference papers and the response time between the proposal, our previous work, and common literature search tools in five reference paper queries. The results show that the proposal increases the number of relevant reference papers by 64.38% and reduces the time by 59.78% on average compared to our previous work, while outperforming common literature search tools in reference papers. Thus, the effectiveness of the proposed system has been demonstrated in our experiments.

Suggested Citation

  • Tresna Maulana Fahrudin & Nobuo Funabiki & Komang Candra Brata & Inzali Naing & Soe Thandar Aung & Amri Muhaimin & Dwi Arman Prasetya, 2025. "An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements," Future Internet, MDPI, vol. 17(5), pages 1-28, April.
  • Handle: RePEc:gam:jftint:v:17:y:2025:i:5:p:195-:d:1644623
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/17/5/195/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/17/5/195/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Quirin Schiermeier, 2017. "Science publishers try new tack to combat unauthorized paper sharing," Nature, Nature, vol. 545(7653), pages 145-146, May.
    2. Hamid R. Jamali & Majid Nabavi, 2015. "Open access and sources of full-text articles in Google Scholar in different subject fields," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(3), pages 1635-1651, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Sergio Copiello & Pietro Bonifaci, 2018. "A few remarks on ResearchGate score and academic reputation," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(1), pages 301-306, January.
    2. Vivek Kumar Singh & Satya Swarup Srichandan & Hiran H. Lathabai, 2022. "ResearchGate and Google Scholar: how much do they differ in publications, citations and different metrics and why?," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(3), pages 1515-1542, March.
    3. Li Zhang & Erin Watson, 2018. "The prevalence of green and grey open access: Where do physical science researchers archive their publications?," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(3), pages 2021-2035, December.
    4. Moed, Henk F. & Bar-Ilan, Judit & Halevi, Gali, 2016. "A new methodology for comparing Google Scholar and Scopus," Journal of Informetrics, Elsevier, vol. 10(2), pages 533-551.
    5. Mikael Laakso & Andrea Polonioli, 2018. "Open access in ethics research: an analysis of open access availability and author self-archiving behaviour in light of journal copyright restrictions," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(1), pages 291-317, July.
    6. Sergio Copiello, 2019. "The open access citation premium may depend on the openness and inclusiveness of the indexing database, but the relationship is controversial because it is ambiguous where the open access boundary lie," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(2), pages 995-1018, November.
    7. Susanne Mikki, 2017. "Scholarly publications beyond pay-walls: increased citation advantage for open publishing," Scientometrics, Springer;Akadémiai Kiadó, vol. 113(3), pages 1529-1538, December.
    8. Cristòfol Rovira & Lluís Codina & Frederic Guerrero-Solé & Carlos Lopezosa, 2019. "Ranking by Relevance and Citation Counts, a Comparative Study: Google Scholar, Microsoft Academic, WoS and Scopus," Future Internet, MDPI, vol. 11(9), pages 1-21, September.
    9. Cristòfol Rovira & Lluís Codina & Carlos Lopezosa, 2021. "Language Bias in the Google Scholar Ranking Algorithm," Future Internet, MDPI, vol. 13(2), pages 1-17, January.
    10. Debarshi Kumar Sanyal & Plaban Kumar Bhowmick & Partha Pratim Das & Samiran Chattopadhyay & T. Y. S. S. Santosh, 2019. "Enhancing access to scholarly publications with surrogate resources," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(2), pages 1129-1164, November.
    11. Lepori, Benedetto & Thelwall, Michael & Hoorani, Bareerah Hafeez, 2018. "Which US and European Higher Education Institutions are visible in ResearchGate and what affects their RG score?," Journal of Informetrics, Elsevier, vol. 12(3), pages 806-818.
    12. Halevi, Gali & Moed, Henk & Bar-Ilan, Judit, 2017. "Suitability of Google Scholar as a source of scientific information and as a source of data for scientific evaluation—Review of the Literature," Journal of Informetrics, Elsevier, vol. 11(3), pages 823-834.
    13. Hamid R. Jamali, 2017. "Copyright compliance and infringement in ResearchGate full-text journal articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 112(1), pages 241-254, July.
    14. Mikael Laakso & Juho Lindman, 2016. "Journal copyright restrictions and actual open access availability: a study of articles published in eight top information systems journals (2010–2014)," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(2), pages 1167-1189, November.
    15. Martín-Martín, Alberto & Costas, Rodrigo & van Leeuwen, Thed & Delgado López-Cózar, Emilio, 2018. "Evidence of open access of scientific publications in Google Scholar: A large-scale analysis," Journal of Informetrics, Elsevier, vol. 12(3), pages 819-841.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:17:y:2025:i:5:p:195-:d:1644623. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.