IDEAS home Printed from https://ideas.repec.org/a/eee/infome/v17y2023i4s1751157723000780.html
   My bibliography  Save this article

A Zipf's law-based text generation approach for addressing imbalance in entity extraction

Author

Listed:
  • Wang, Zhenhua
  • Ren, Ming
  • Gao, Dong
  • Li, Zhuang

Abstract

Entity extraction is critical in the intelligent advancement across diverse domains. Nevertheless, a challenge to its effectiveness arises from the data imbalance, where certain entities are common while others are scarce. To address this issue, this study proposes a novel text generation approach that harnesses Zipf's law, which is a powerful tool from informetrics for studying human language. By employing characteristics of Zipf's law, words within the documents are classified as common and rare ones. Subsequently, sentences are classified into common and rare ones, and are further processed by text generation models accordingly. Rare entities within the generated sentences are then labeled using human-designed rules, serving as a supplement to the raw dataset, thereby mitigating the imbalance problem. The study presents a case of extracting entities from technical documents, and the extensive experimental results on two datasets prove the effectiveness of the proposed method. Furthermore, the significance and potential of Zipf's law in driving the progress of artificial intelligence (AI) is discussed, broadening the scope and coverage of informetrics. By incorporating the foundational principles of informetrics into text generation, this study showcases the pivotal role of informetrics in shaping the design and developmental of AI systems.

Suggested Citation

  • Wang, Zhenhua & Ren, Ming & Gao, Dong & Li, Zhuang, 2023. "A Zipf's law-based text generation approach for addressing imbalance in entity extraction," Journal of Informetrics, Elsevier, vol. 17(4).
  • Handle: RePEc:eee:infome:v:17:y:2023:i:4:s1751157723000780
    DOI: 10.1016/j.joi.2023.101453
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S1751157723000780
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.joi.2023.101453?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Yi Zhang & Chengzhi Zhang & Philipp Mayr & Arho Suominen, 2022. "An editorial of “AI + informetrics”: multi-disciplinary interactions in the era of big data," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6503-6507, November.
    2. Song, Min & Heo, Go Eun & Ding, Ying, 2015. "SemPathFinder: Semantic path analysis for discovering publicly unknown knowledge," Journal of Informetrics, Elsevier, vol. 9(4), pages 686-703.
    3. Fernandez Martinez, Roberto & Lostado Lorza, Ruben & Santos Delgado, Ana Alexandra & Piedra, Nelson, 2021. "Use of classification trees and rule-based models to optimize the funding assignment to research projects: A case study of UTPL," Journal of Informetrics, Elsevier, vol. 15(1).
    4. Chowdhury, K.P., 2021. "Functional analysis of generalized linear models under non-linear constraints with applications to identifying highly-cited papers," Journal of Informetrics, Elsevier, vol. 15(1).
    5. Wang, Qiuping A., 2021. "Principle of least effort vs. maximum efficiency: deriving Zipf-Pareto's laws," Chaos, Solitons & Fractals, Elsevier, vol. 153(P1).
    6. Ting‐Hao Yang & Yu‐Lun Hsieh & Shih‐Hung Liu & Yung‐Chun Chang & Wen‐Lian Hsu, 2021. "A flexible template generation and matching method with applications for publication reference metadata extraction," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(1), pages 32-45, January.
    7. Jeong, Yoo Kyung & Xie, Qing & Yan, Erjia & Song, Min, 2020. "Examining drug and side effect relation using author–entity pair bipartite networks," Journal of Informetrics, Elsevier, vol. 14(1).
    8. Anil, Akash & Singh, Sanasam Ranbir, 2020. "Effect of class imbalance in heterogeneous network embedding: An empirical study," Journal of Informetrics, Elsevier, vol. 14(2).
    9. Wang, Yuzhuo & Zhang, Chengzhi, 2020. "Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing," Journal of Informetrics, Elsevier, vol. 14(4).
    10. Chen, Liang & Xu, Shuo & Zhu, Lijun & Zhang, Jing & Yang, Guancan & Xu, Haiyun, 2022. "A deep learning based method benefiting from characteristics of patents for semantic relation classification," Journal of Informetrics, Elsevier, vol. 16(3).
    11. An, Xin & Li, Jinghong & Xu, Shuo & Chen, Liang & Sun, Wei, 2021. "An improved patent similarity measurement based on entities and semantic relations," Journal of Informetrics, Elsevier, vol. 15(2).
    12. Andreas Vlachidis & Douglas Tudhope, 2016. "A knowledge‐based approach to Information Extraction for semantic interoperability in the archaeology domain," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 67(5), pages 1138-1152, May.
    13. Valero, Jordi & Pérez-Casany, Marta & Duarte-López, Ariel, 2022. "The Zipf-Polylog distribution: Modeling human interactions through social networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 603(C).
    14. Song, Min & Kim, Erin Hea-Jin & Kim, Ha Jin, 2015. "Exploring author name disambiguation on PubMed-scale," Journal of Informetrics, Elsevier, vol. 9(4), pages 924-941.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ciriaco Andrea D’Angelo & Nees Jan Eck, 2020. "Collecting large-scale publication data at the level of individual researchers: a practical proposal for author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(2), pages 883-907, May.
    2. Li, Heyang & Wu, Meijun & Wang, Yougui & Zeng, An, 2022. "Bibliographic coupling networks reveal the advantage of diversification in scientific projects," Journal of Informetrics, Elsevier, vol. 16(3).
    3. Jinseok Kim & Jinmo Kim & Jason Owen-Smith, 2019. "Generating automatically labeled data for author name disambiguation: an iterative clustering method," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(1), pages 253-280, January.
    4. Lee, O-Joun & Jeon, Hyeon-Ju & Jung, Jason J., 2021. "Learning multi-resolution representations of research patterns in bibliographic networks," Journal of Informetrics, Elsevier, vol. 15(1).
    5. Chen, Liang & Xu, Shuo & Zhu, Lijun & Zhang, Jing & Yang, Guancan & Xu, Haiyun, 2022. "A deep learning based method benefiting from characteristics of patents for semantic relation classification," Journal of Informetrics, Elsevier, vol. 16(3).
    6. Guangtong Li & L. Siddharth & Jianxi Luo, 2023. "Embedding knowledge graph of patent metadata to measure knowledge proximity," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 74(4), pages 476-490, April.
    7. Yuzhuo Wang & Chengzhi Zhang & Kai Li, 2022. "A review on method entities in the academic literature: extraction, evaluation, and application," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(5), pages 2479-2520, May.
    8. Xiaorui Jiang & Jingqiang Chen, 2023. "Contextualised segment-wise citation function classification," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(9), pages 5117-5158, September.
    9. Jinseok Kim & Jenna Kim & Jason Owen‐Smith, 2021. "Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(8), pages 979-994, August.
    10. Chowdhury, K.P., 2023. "Nonparametric functional analysis under joint estimation with applications to identifying highly cited papers," Journal of Informetrics, Elsevier, vol. 17(4).
    11. Lv, Yanhua & Ding, Ying & Song, Min & Duan, Zhiguang, 2018. "Topology-driven trend analysis for drug discovery," Journal of Informetrics, Elsevier, vol. 12(3), pages 893-905.
    12. Agouzal, Abdellatif & Lafouge, Thierry & Bertin, Marc, 2024. "Relationship between the principle of least effort and the average cost of information in a zipfian context," Journal of Informetrics, Elsevier, vol. 18(1).
    13. Jinseok Kim & Jason Owen-Smith, 2021. "ORCID-linked labeled data for evaluating author name disambiguation at scale," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(3), pages 2057-2083, March.
    14. Jinseok Kim & Jenna Kim, 2020. "Effect of forename string on author name disambiguation," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 71(7), pages 839-855, July.
    15. Wei Du & Yibo Wang & Wei Xu & Jian Ma, 2021. "A personalized recommendation system for high-quality patent trading by leveraging hybrid patent analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(12), pages 9369-9391, December.
    16. Li Zhang & Wei Lu & Jinqing Yang, 2023. "LAGOS‐AND: A large gold standard dataset for scholarly author name disambiguation," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 74(2), pages 168-185, February.
    17. Wang, Ruby W. & Wei, Shelia X. & Ye, Fred Y., 2021. "Extracting a core structure from heterogeneous information network using h-subnet and meta-path strength," Journal of Informetrics, Elsevier, vol. 15(3).
    18. Mehmet Ali Abdulhayoglu & Bart Thijs, 2017. "Use of ResearchGate and Google CSE for author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(3), pages 1965-1985, June.
    19. Jaewoong Choi & Jiho Lee & Janghyeok Yoon & Sion Jang & Jaeyoung Kim & Sungchul Choi, 2022. "A two-stage deep learning-based system for patent citation recommendation," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6615-6636, November.
    20. Grazia Sveva Ascione & Laura Ciucci & Claudio Detotto & Valerio Sterzi, 2022. "Universities involvement in patent litigation: an analysis of the characteristics of US litigated patents," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 6855-6879, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:infome:v:17:y:2023:i:4:s1751157723000780. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/joi .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.