Author
Listed:
- Tresna Maulana Fahrudin
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)
- Nobuo Funabiki
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)
- Komang Candra Brata
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan
Department of Informatics Engineering, Universitas Brawijaya, Malang 65145, Indonesia)
- Inzali Naing
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)
- Soe Thandar Aung
(Department of Information and Communication Systems, Okayama University, Okayama 700-8530, Japan)
- Amri Muhaimin
(Department of Data Science, Universitas Pembangunan Nasional Veteran Jawa Timur, Surabaya 60294, Indonesia)
- Dwi Arman Prasetya
(Department of Data Science, Universitas Pembangunan Nasional Veteran Jawa Timur, Surabaya 60294, Indonesia)
Abstract
Nowadays, accessibility to academic papers has been significantly improved with electric publications on the internet, where open access has become common. At the same time, it has increased workloads in literature surveys for researchers who usually manually download PDF files and check their contents. To solve this drawback, we have proposed a reference paper collection system using a web scraping technology and natural language models. However, our previous system often finds a limited number of relevant reference papers after taking long time, since it relies on one paper search website and runs on a single thread at a multi-core CPU. In this paper, we present an improved reference paper collection system with three enhancements to solve them: (1) integrating the APIs from multiple paper search web sites, namely, the bulk search endpoint in the Semantic Scholar API, the article search endpoint in the DOAJ API, and the search and fetch endpoint in the PubMed API to retrieve article metadata, (2) running the program on multiple threads for multi-core CPU, and (3) implementing Dynamic URL Redirection , Regex-based URL Parsing , and HTML Scraping with URL Extraction for fast checking of PDF file accessibility, along with sentence embedding to assess relevance based on semantic similarity. For evaluations, we compare the number of obtained reference papers and the response time between the proposal, our previous work, and common literature search tools in five reference paper queries. The results show that the proposal increases the number of relevant reference papers by 64.38% and reduces the time by 59.78% on average compared to our previous work, while outperforming common literature search tools in reference papers. Thus, the effectiveness of the proposed system has been demonstrated in our experiments.
Suggested Citation
Tresna Maulana Fahrudin & Nobuo Funabiki & Komang Candra Brata & Inzali Naing & Soe Thandar Aung & Amri Muhaimin & Dwi Arman Prasetya, 2025.
"An Improved Reference Paper Collection System Using Web Scraping with Three Enhancements,"
Future Internet, MDPI, vol. 17(5), pages 1-28, April.
Handle:
RePEc:gam:jftint:v:17:y:2025:i:5:p:195-:d:1644623
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:17:y:2025:i:5:p:195-:d:1644623. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.