IDEAS home Printed from https://ideas.repec.org/a/bla/jinfst/v74y2023i9p1124-1139.html
   My bibliography  Save this article

Addressing structural hurdles for metadata extraction from environmental impact statements

Author

Listed:
  • Egoitz Laparra
  • Alex Binford‐Walsh
  • Kirk Emerson
  • Marc L. Miller
  • Laura López‐Hoffman
  • Faiz Currim
  • Steven Bethard

Abstract

Natural language processing techniques can be used to analyze the linguistic content of a document to extract missing pieces of metadata. However, accurate metadata extraction may not depend solely on the linguistics, but also on structural problems such as extremely large documents, unordered multi‐file documents, and inconsistency in manually labeled metadata. In this work, we start from two standard machine learning solutions to extract pieces of metadata from Environmental Impact Statements, environmental policy documents that are regularly produced under the US National Environmental Policy Act of 1969. We present a series of experiments where we evaluate how these standard approaches are affected by different issues derived from real‐world data. We find that metadata extraction can be strongly influenced by nonlinguistic factors such as document length and volume ordering and that the standard machine learning solutions often do not scale well to long documents. We demonstrate how such solutions can be better adapted to these scenarios, and conclude with suggestions for other NLP practitioners cataloging large document collections.

Suggested Citation

  • Egoitz Laparra & Alex Binford‐Walsh & Kirk Emerson & Marc L. Miller & Laura López‐Hoffman & Faiz Currim & Steven Bethard, 2023. "Addressing structural hurdles for metadata extraction from environmental impact statements," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 74(9), pages 1124-1139, September.
  • Handle: RePEc:bla:jinfst:v:74:y:2023:i:9:p:1124-1139
    DOI: 10.1002/asi.24809
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.24809
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.24809?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jinfst:v:74:y:2023:i:9:p:1124-1139. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.