An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements

An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements

Author

Listed:

Tresna Maulana Fahrudin
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)
Nobuo Funabiki
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)
Komang Candra Brata
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan
Department of Informatics Engineering, Universitas Brawijaya, Malang 65145, Indonesia)
Inzali Naing
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)
Soe Thandar Aung
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)
Amri Muhaimin
(Department of Data Science, Universitas Pembangunan Nasional Veteran Jawa Timur, Surabaya 60294, Indonesia)
Dwi Arman Prasetya
(Department of Data Science, Universitas Pembangunan Nasional Veteran Jawa Timur, Surabaya 60294, Indonesia)

Abstract

Nowadays, accessibility to academic papers has been significantly improved with electric publications on the internet, where open access has become common. At the same time, it has increased workloads in literature surveys for researchers who usually manually download PDF files and check their contents. To solve this drawback, we have proposed a reference paper collection system using a web scraping technology and natural language models. However, our previous system often finds a limited number of relevant reference papers after taking long time, since it relies on one paper search website and runs on a single thread at a multi-core CPU. In this paper, we present an improved reference paper collection system with three enhancements to solve them: (1) integrating the APIs from multiple paper search web sites, namely, the bulk search endpoint in the Semantic Scholar API, the article search endpoint in the DOAJ API, and the search and fetch endpoint in the PubMed API to retrieve article metadata, (2) running the program on multiple threads for multi-core CPU, and (3) implementing Dynamic URL Redirection , Regex-based URL Parsing , and HTML Scraping with URL Extraction for fast checking of PDF file accessibility, along with sentence embedding to assess relevance based on semantic similarity. For evaluations, we compare the number of obtained reference papers and the response time between the proposal, our previous work, and common literature search tools in five reference paper queries. The results show that the proposal increases the number of relevant reference papers by 64.38% and reduces the time by 59.78% on average compared to our previous work, while outperforming common literature search tools in reference papers. Thus, the effectiveness of the proposed system has been demonstrated in our experiments.

Suggested Citation

Tresna Maulana Fahrudin & Nobuo Funabiki & Komang Candra Brata & Inzali Naing & Soe Thandar Aung & Amri Muhaimin & Dwi Arman Prasetya, 2025. "An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements," Future Internet, MDPI, vol. 17(5), pages 1-28, April.

Handle: RePEc:gam:jftint:v:17:y:2025:i:5:p:195-:d:1644623

Download full text from publisher

References listed on IDEAS

Quirin Schiermeier, 2017. "Science publishers try new tack to combat unauthorized paper sharing," Nature, Nature, vol. 545(7653), pages 145-146, May.
Hamid R. Jamali & Majid Nabavi, 2015. "Open access and sources of full-text articles in Google Scholar in different subject fields," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(3), pages 1635-1651, December.

Full references (including those not matched with items on IDEAS)

Citations

Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

Cited by:

Manuel Blázquez-Ochando & Juan José Prieto-Gutiérrez & María Antonia Ovalle-Perandones, 2025. "Prompt engineering for bibliographic web-scraping," Scientometrics, Springer;Akadémiai Kiadó, vol. 130(7), pages 3433-3453, July.

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Sergio Copiello & Pietro Bonifaci, 2018. "A few remarks on ResearchGate score and academic reputation," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(1), pages 301-306, January.
Vivek Kumar Singh & Satya Swarup Srichandan & Hiran H. Lathabai, 2022. "ResearchGate and Google Scholar: how much do they differ in publications, citations and different metrics and why?," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(3), pages 1515-1542, March.
Kevin Riehl, 2025. "On the scientometric value of full-text, beyond abstracts and titles: evidence from the business and economic literature," Management Review Quarterly, Springer, vol. 75(3), pages 2459-2513, September.
Li Zhang & Erin Watson, 2018. "The prevalence of green and grey open access: Where do physical science researchers archive their publications?," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(3), pages 2021-2035, December.
Moed, Henk F. & Bar-Ilan, Judit & Halevi, Gali, 2016. "A new methodology for comparing Google Scholar and Scopus," Journal of Informetrics, Elsevier, vol. 10(2), pages 533-551.
Mikael Laakso & Andrea Polonioli, 2018. "Open access in ethics research: an analysis of open access availability and author self-archiving behaviour in light of journal copyright restrictions," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(1), pages 291-317, July.
Sergio Copiello, 2019. "The open access citation premium may depend on the openness and inclusiveness of the indexing database, but the relationship is controversial because it is ambiguous where the open access boundary lies," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(2), pages 995-1018, November.
Susanne Mikki, 2017. "Scholarly publications beyond pay-walls: increased citation advantage for open publishing," Scientometrics, Springer;Akadémiai Kiadó, vol. 113(3), pages 1529-1538, December.
Cristòfol Rovira & Lluís Codina & Frederic Guerrero-Solé & Carlos Lopezosa, 2019. "Ranking by Relevance and Citation Counts, a Comparative Study: Google Scholar, Microsoft Academic, WoS and Scopus," Future Internet, MDPI, vol. 11(9), pages 1-21, September.
Cristòfol Rovira & Lluís Codina & Carlos Lopezosa, 2021. "Language Bias in the Google Scholar Ranking Algorithm," Future Internet, MDPI, vol. 13(2), pages 1-17, January.
Debarshi Kumar Sanyal & Plaban Kumar Bhowmick & Partha Pratim Das & Samiran Chattopadhyay & T. Y. S. S. Santosh, 2019. "Enhancing access to scholarly publications with surrogate resources," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(2), pages 1129-1164, November.
Lepori, Benedetto & Thelwall, Michael & Hoorani, Bareerah Hafeez, 2018. "Which US and European Higher Education Institutions are visible in ResearchGate and what affects their RG score?," Journal of Informetrics, Elsevier, vol. 12(3), pages 806-818.
Halevi, Gali & Moed, Henk & Bar-Ilan, Judit, 2017. "Suitability of Google Scholar as a source of scientific information and as a source of data for scientific evaluation—Review of the Literature," Journal of Informetrics, Elsevier, vol. 11(3), pages 823-834.
Hamid R. Jamali, 2017. "Copyright compliance and infringement in ResearchGate full-text journal articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 112(1), pages 241-254, July.
Mikael Laakso & Juho Lindman, 2016. "Journal copyright restrictions and actual open access availability: a study of articles published in eight top information systems journals (2010–2014)," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(2), pages 1167-1189, November.
Martín-Martín, Alberto & Costas, Rodrigo & van Leeuwen, Thed & Delgado López-Cózar, Emilio, 2018. "Evidence of open access of scientific publications in Google Scholar: A large-scale analysis," Journal of Informetrics, Elsevier, vol. 12(3), pages 819-841.

More about this item

Keywords

; ; ; ; ;

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:17:y:2025:i:5:p:195-:d:1644623. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager The email address of this maintainer does not seem to be valid anymore. Please ask MDPI Indexing Manager to update the entry or send us the correct address (email available below). General contact details of provider: https://www.mdpi.com .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Citations

Most related items

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data