IDEAS home Printed from https://ideas.repec.org/a/gam/jsusta/v14y2022i5p2802-d760423.html
   My bibliography  Save this article

Layout Aware Semantic Element Extraction for Sustainable Science & Technology Decision Support

Author

Listed:
  • Hyuntae Kim

    (Department of Computer Engineering, Kumoh National Institute of Technology, Gumi 39177, Korea)

  • Jongyun Choi

    (Department of Computer Engineering, Kumoh National Institute of Technology, Gumi 39177, Korea)

  • Soyoung Park

    (Department of Computer Engineering, Kumoh National Institute of Technology, Gumi 39177, Korea)

  • Yuchul Jung

    (Department of Computer Engineering, Kumoh National Institute of Technology, Gumi 39177, Korea)

Abstract

New scientific and technological (S&T) knowledge is being introduced rapidly, and hence, analysis efforts to understand and analyze new published S&T documents are increasing daily. Automated text mining and vision recognition techniques alleviate the burden somewhat, but the various document layout formats and knowledge content granularities across the S&T field make it challenging. Therefore, this paper proposes LA-SEE (LAME and Vi-SEE), a knowledge graph construction framework that simultaneously extracts meta-information and useful image objects from S&T documents in various layout formats. We adopt Layout-aware Metadata Extraction (LAME), which can accurately extract metadata from various layout formats, and implement a transformer-based instance segmentation (i.e., Vision based Semantic Elements Extraction (Vi-SEE)) to maximize the vision-based semantic element recognition. Moreover, to constructing a scientific knowledge graph consisting of multiple S&T documents, we newly defined an extensible Semantic Elements Knowledge Graph (SEKG) structure. For now, we succeeded in extracting about 6 million semantic elements from 49,649 PDFs. In addition, to illustrate the potential power of our SEKG, we provide two promising application scenarios, such as a scientific knowledge guide across multiple S&T documents and questions and answering over scientific tables.

Suggested Citation

  • Hyuntae Kim & Jongyun Choi & Soyoung Park & Yuchul Jung, 2022. "Layout Aware Semantic Element Extraction for Sustainable Science & Technology Decision Support," Sustainability, MDPI, vol. 14(5), pages 1-18, February.
  • Handle: RePEc:gam:jsusta:v:14:y:2022:i:5:p:2802-:d:760423
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2071-1050/14/5/2802/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2071-1050/14/5/2802/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jsusta:v:14:y:2022:i:5:p:2802-:d:760423. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.