IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v4y2019i2p48-d219221.html
   My bibliography  Save this article

From a Smoking Gun to Spent Fuel: Principled Subsampling Methods for Building Big Language Data Corpora from Monitor Corpora

Author

Listed:
  • Jacqueline Hettel Tidwell

    (Department of English, Franklin College of Arts and Sciences, University of Georgia, Athens, GA 30602, USA)

Abstract

With the influence of Big Data culture on qualitative data collection, acquisition, and processing, it is becoming increasingly important that social scientists understand the complexity underlying data collection and the resulting models and analyses. Systematic approaches for creating computationally tractable models need to be employed in order to create representative, specialized reference corpora subsampled from Big Language Data sources. Even more importantly, any such method must be tested and vetted for its reproducibility and consistency in generating a representative model of a particular population in question. This article considers and tests one such method for Big Language Data downsampling of digitally accessible language data to determine both how to operationalize this form of corpus model creation, as well as testing whether the method is reproducible. Using the U.S. Nuclear Regulatory Commission’s public documentation database as a test source, the sampling method’s procedure was evaluated to assess variation in the rate of which documents were deemed fit for inclusion or exclusion from the corpus across four iterations. After performing multiple sampling iterations, the approach pioneered by the Tobacco Documents Corpus creators was deemed to be reproducible and valid using a two-proportion z-test at a 99% confidence interval at each stage of the evaluation process–leading to a final mean rejection ratio of 23.5875 and variance of 0.891 for the documents sampled and evaluated for inclusion into the final text-based model. The findings of this study indicate that such a principled sampling method is viable, thus necessitating the need for an approach for creating language-based models that account for extralinguistic factors and linguistic characteristics of documents.

Suggested Citation

  • Jacqueline Hettel Tidwell, 2019. "From a Smoking Gun to Spent Fuel: Principled Subsampling Methods for Building Big Language Data Corpora from Monitor Corpora," Data, MDPI, vol. 4(2), pages 1-20, April.
  • Handle: RePEc:gam:jdataj:v:4:y:2019:i:2:p:48-:d:219221
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/4/2/48/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/4/2/48/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:4:y:2019:i:2:p:48-:d:219221. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.