IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v128y2023i5d10.1007_s11192-023-04674-w.html
   My bibliography  Save this article

Optimization of scientific publications clustering with ensemble approach for topic extraction

Author

Listed:
  • Mohammed Azmi Al-Betar

    (Ajman University
    Al-Huson University College
    Ajman University)

  • Ammar Kamal Abasi

    (Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI))

  • Ghazi Al-Naymat

    (Ajman University
    Ajman University)

  • Kamran Arshad

    (Ajman University
    Ajman University)

  • Sharif Naser Makhadmeh

    (Ajman University)

Abstract

The continually developing Internet generates a considerable amount of text data. When attempting to extract general topics or themes from a massive corpus of documents, dealing with such a large volume of text data in an unstructured format is a big problem. Text document clustering (TDC) is a technique for grouping texts based on their content similarity. Partitioning text collection based on the documents’ content significance is one of the most challenging tasks at TDC. This study proposes the Bare-Bones Based Salp Swarm Algorithm (BBSSA) to solve the problem of TDC. In addition, to extract the topics from the clusters, an ensemble approach for automatic topic extraction (TE) is proposed. The proposed BBSSA and the ensemble TE approach are tested using six standard benchmarks and six scientific publishing datasets from top QS ranking UAE universities. BBSSA’s findings are compared with sixteen well-known techniques, including eleven metaheuristic algorithms, such as the Whale Optimization Algorithm (WOA), Firefly Algorithm (FFA), Bat Algorithm (BAT), Harmony Search (HS), Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Multi-Verse Optimizer (MVO), Grey Wolf Optimizer (GWO), Moth-Flame Optimization (MFO), Krill Herd Algorithm (KHA), SSA, and five clustering methods, such as K-means++, K-means, Density-based Spatial Clustering of Applications with Noise (DBSCAN), Spectral, and Agglomerative. The results of the ensemble TE approach are compared with those of seven well-known statistical methods, including Mutual Information (MI), TextRank (TR), Co-Occurrence Statistical Information-based Keyword Extraction (CSI), Term Frequency-Inverse Document Frequency (TF-IDF), most frequent based keyword extraction (TF), YAKE!, and RAKE. According to the experiments, the BBSSA outperforms all other approaches and is exceedingly competitive. The results also reveal that for most datasets, the proposed ensemble TE strategy outperforms all existing TE methods based on external metrics. Thus, the ensemble TE approach can be seen as a supplement to the other methods.

Suggested Citation

  • Mohammed Azmi Al-Betar & Ammar Kamal Abasi & Ghazi Al-Naymat & Kamran Arshad & Sharif Naser Makhadmeh, 2023. "Optimization of scientific publications clustering with ensemble approach for topic extraction," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(5), pages 2819-2877, May.
  • Handle: RePEc:spr:scient:v:128:y:2023:i:5:d:10.1007_s11192-023-04674-w
    DOI: 10.1007/s11192-023-04674-w
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-023-04674-w
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-023-04674-w?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Rob Koopman & Shenghui Wang & Andrea Scharnhorst, 2017. "Contextualization of topics: browsing through the universe of bibliographic information," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1119-1139, May.
    2. Theresa Velden & Kevin W. Boyack & Jochen Gläser & Rob Koopman & Andrea Scharnhorst & Shenghui Wang, 2017. "Comparison of topic extraction approaches and their results," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1169-1221, May.
    3. Narjes Vara & Mahdieh Mirzabeigi & Hajar Sotudeh & Seyed Mostafa Fakhrahmad, 2022. "Application of k-means clustering algorithm to improve effectiveness of the results recommended by journal recommender system," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(6), pages 3237-3252, June.
    4. Nees Jan Eck & Ludo Waltman, 2017. "Citation-based clustering of publications using CitNetExplorer and VOSviewer," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1053-1070, May.
    5. Lutz Bornmann & Rüdiger Mutz, 2015. "Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 66(11), pages 2215-2222, November.
    6. Chengzhi Zhang & Lei Zhao & Mengyuan Zhao & Yingyi Zhang, 2022. "Enhancing keyphrase extraction from academic articles with their reference information," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(2), pages 703-731, February.
    7. Sharif Naser Makhadmeh & Mohammed Azmi Al-Betar & Mohammed A. Awadallah & Ammar Kamal Abasi & Zaid Abdi Alkareem Alyasseri & Iyad Abu Doush & Osama Ahmad Alomari & Robertas Damaševičius & Audrius Zaja, 2022. "A Modified Coronavirus Herd Immunity Optimizer for the Power Scheduling Problem," Mathematics, MDPI, vol. 10(3), pages 1-29, January.
    8. Ruhao Zhang & Junpeng Yuan, 2022. "Enhanced author bibliographic coupling analysis using semantic and syntactic citation information," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7681-7706, December.
    9. Esra Gündoğan & Mehmet Kaya, 2022. "A novel hybrid paper recommendation system using deep learning," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(7), pages 3837-3855, July.
    10. Shenghui Wang & Rob Koopman, 2017. "Clustering articles based on semantic similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1017-1031, May.
    11. Rob Koopman & Shenghui Wang, 2017. "Mutual information based labelling and comparing clusters," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1157-1167, May.
    12. Timea Bezdan & Catalin Stoean & Ahmed Al Naamany & Nebojsa Bacanin & Tarik A. Rashid & Miodrag Zivkovic & K. Venkatachalam, 2021. "Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering," Mathematics, MDPI, vol. 9(16), pages 1-19, August.
    13. Yuzhuo Wang & Chengzhi Zhang & Kai Li, 2022. "A review on method entities in the academic literature: extraction, evaluation, and application," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(5), pages 2479-2520, May.
    14. Zhang, Yi & Zhang, Guangquan & Chen, Hongshu & Porter, Alan L. & Zhu, Donghua & Lu, Jie, 2016. "Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research," Technological Forecasting and Social Change, Elsevier, vol. 105(C), pages 179-191.
    15. Prabowo, Rudy & Thelwall, Mike, 2009. "Sentiment analysis: A combined approach," Journal of Informetrics, Elsevier, vol. 3(2), pages 143-157.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jochen Gläser & Wolfgang Glänzel & Andrea Scharnhorst, 2017. "Same data—different results? Towards a comparative approach to the identification of thematic structures in science," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 981-998, May.
    2. Paul Donner, 2021. "Validation of the Astro dataset clustering solutions with external data," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1619-1645, February.
    3. Rob Koopman & Shenghui Wang & Andrea Scharnhorst, 2017. "Contextualization of topics: browsing through the universe of bibliographic information," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1119-1139, May.
    4. Theresa Velden & Kevin W. Boyack & Jochen Gläser & Rob Koopman & Andrea Scharnhorst & Shenghui Wang, 2017. "Comparison of topic extraction approaches and their results," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1169-1221, May.
    5. Theresa Velden & Shiyan Yan & Carl Lagoze, 2017. "Mapping the cognitive structure of astrophysics by infomap clustering of the citation network and topic affinity analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1033-1051, May.
    6. Rob Koopman & Shenghui Wang, 2017. "Mutual information based labelling and comparing clusters," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1157-1167, May.
    7. Frank Havemann & Jochen Gläser & Michael Heinz, 2017. "Memetic search for overlapping topics based on a local evaluation of link communities," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1089-1118, May.
    8. Shuo Xu & Junwan Liu & Dongsheng Zhai & Xin An & Zheng Wang & Hongshen Pang, 2018. "Overlapping thematic structures extraction with mixed-membership stochastic blockmodel," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 61-84, October.
    9. Takahiro Kawamura & Katsutaro Watanabe & Naoya Matsumoto & Shusaku Egami & Mari Jibu, 2018. "Funding map using paragraph embedding based on semantic diversity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 941-958, August.
    10. Sabrina L. Woltmann & Lars Alkærsig, 2018. "Tracing university–industry knowledge transfer through a text mining approach," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 449-472, October.
    11. Hong Shi & Mengmeng Cheng & Yi Feng & Chenghui Qiu & Caiyue Song & Nenglin Yuan & Chuanzhi Kang & Kaijie Yang & Jie Yuan & Yonghao Li, 2023. "Thermal Management Techniques for Lithium-Ion Batteries Based on Phase Change Materials: A Systematic Review and Prospective Recommendations," Energies, MDPI, vol. 16(2), pages 1-23, January.
    12. Shenghui Wang & Rob Koopman, 2017. "Clustering articles based on semantic similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1017-1031, May.
    13. Matthias Held & Grit Laudel & Jochen Gläser, 2021. "Challenges to the validity of topic reconstruction," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 4511-4536, May.
    14. Kevin W. Boyack, 2017. "Investigating the effect of global data on topic detection," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 999-1015, May.
    15. Zhang, Yi & Lu, Jie & Liu, Feng & Liu, Qian & Porter, Alan & Chen, Hongshu & Zhang, Guangquan, 2018. "Does deep learning help topic extraction? A kernel k-means clustering method with word embedding," Journal of Informetrics, Elsevier, vol. 12(4), pages 1099-1117.
    16. Christian Weismayer & Ilona Pezenka, 2017. "Identifying emerging research fields: a longitudinal latent semantic keyword analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 113(3), pages 1757-1785, December.
    17. Ebadi, Ashkan & Tremblay, Stéphane & Goutte, Cyril & Schiffauerova, Andrea, 2020. "Application of machine learning techniques to assess the trends and alignment of the funded research output," Journal of Informetrics, Elsevier, vol. 14(2).
    18. Samira Ranaei & Arho Suominen & Alan Porter & Stephen Carley, 2020. "Evaluating technological emergence using text analytics: two case technologies and three approaches," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(1), pages 215-247, January.
    19. Carlos Olmeda-Gómez & Carlos Romá-Mateo & Maria-Antonia Ovalle-Perandones, 2019. "Overview of trends in global epigenetic research (2009–2017)," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1545-1574, June.
    20. de Carvalho, Gustavo Dambiski Gomes & Sokulski, Carla Cristiane & da Silva, Wesley Vieira & de Carvalho, Hélio Gomes & de Moura, Rafael Vignoli & de Francisco, Antonio Carlos & da Veiga, Claudimar Per, 2020. "Bibliometrics and systematic reviews: A comparison between the Proknow-C and the Methodi Ordinatio," Journal of Informetrics, Elsevier, vol. 14(3).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:128:y:2023:i:5:d:10.1007_s11192-023-04674-w. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.