IDEAS home Printed from https://ideas.repec.org/a/spr/envsyd/v38y2018i3d10.1007_s10669-017-9670-5.html
   My bibliography  Save this article

Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts

Author

Listed:
  • Arun Varghese

    (ICF)

  • Michelle Cawley

    (ICF)

  • Tao Hong

    (ICF)

Abstract

Machine learning and natural language processing algorithms are currently widely used to retrieve relevant documents in a variety of contexts, including literature review and systematic review. Supervised machine learning algorithms perform well in terms of retrieval metrics such as recall and precision, but require the use of a sizeable training dataset, which is typically expensive to develop. Unsupervised machine learning algorithms do not require a training dataset and may perform well in terms of recall, but are typically lower in precision, and do not offer a transparent means for decision-makers to justify selection choices. In this paper, we illustrate the use of a hybrid document classification method based on semi-supervised learning that we refer to as “supervised clustering.” We show that supervised clustering combines the ease of use of unsupervised algorithms with the retrieval efficiency and transparency of supervised algorithms. We demonstrate through simulations the high performance and unbiased predictions of supervised clustering when provided even with only minimal training data. We further propose the use of ensemble learning as a means to maximize retrieval efficiency and to prioritize the review of those documents that are not eliminated by the supervised clustering algorithm.

Suggested Citation

  • Arun Varghese & Michelle Cawley & Tao Hong, 2018. "Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts," Environment Systems and Decisions, Springer, vol. 38(3), pages 398-414, September.
  • Handle: RePEc:spr:envsyd:v:38:y:2018:i:3:d:10.1007_s10669-017-9670-5
    DOI: 10.1007/s10669-017-9670-5
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10669-017-9670-5
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10669-017-9670-5?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Karthik Devarajan, 2008. "Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology," PLOS Computational Biology, Public Library of Science, vol. 4(7), pages 1-12, July.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Bhavna Singichetti & Adam Dodd & Jamie L. Conklin & Kristen Hassmiller Lich & Nasim S. Sabounchi & Rebecca B. Naumann, 2022. "Trends and Insights from Transportation Congestion Pricing Policy Research: A Bibliometric Analysis," IJERPH, MDPI, vol. 19(12), pages 1-13, June.
    2. Elizabeth C. Christenson & Ryan Cronk & Helen Atkinson & Aayush Bhatt & Emilio Berdiel & Michelle Cawley & Grace Cho & Collin Knox Coleman & Cailee Harrington & Kylie Heilferty & Don Fejfar & Emily J., 2021. "Evidence Map and Systematic Review of Disinfection Efficacy on Environmental Surfaces in Healthcare Facilities," IJERPH, MDPI, vol. 18(21), pages 1-22, October.
    3. Darcy M. Anderson & Ryan Cronk & Donald Fejfar & Emily Pak & Michelle Cawley & Jamie Bartram, 2021. "Safe Healthcare Facilities: A Systematic Review on the Costs of Establishing and Maintaining Environmental Health in Facilities in Low- and Middle-Income Countries," IJERPH, MDPI, vol. 18(2), pages 1-22, January.
    4. Annika M. Schoene & Ioannis Basinas & Martie van Tongeren & Sophia Ananiadou, 2022. "A Narrative Literature Review of Natural Language Processing Applied to the Occupational Exposome," IJERPH, MDPI, vol. 19(14), pages 1-14, July.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Paul Fogel & Yann Gaston-Mathé & Douglas Hawkins & Fajwel Fogel & George Luta & S. Stanley Young, 2016. "Applications of a Novel Clustering Approach Using Non-Negative Matrix Factorization to Environmental Research in Public Health," IJERPH, MDPI, vol. 13(5), pages 1-14, May.
    2. GILLIS, Nicolas & GLINEUR, François, 2011. "Accelerated multiplicative updates and hierarchical als algorithms for nonnegative matrix factorization," LIDAM Discussion Papers CORE 2011030, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    3. Flavia Esposito, 2021. "A Review on Initialization Methods for Nonnegative Matrix Factorization: Towards Omics Data Experiments," Mathematics, MDPI, vol. 9(9), pages 1-17, April.
    4. Haixuan Yang & Cathal Seoighe, 2016. "Impact of the Choice of Normalization Method on Molecular Cancer Class Discovery Using Nonnegative Matrix Factorization," PLOS ONE, Public Library of Science, vol. 11(10), pages 1-17, October.
    5. GILLIS, Nicolas & GLINEUR, François, 2008. "Nonnegative factorization and the maximum edge biclique problem," LIDAM Discussion Papers CORE 2008064, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    6. Jingu Kim & Yunlong He & Haesun Park, 2014. "Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework," Journal of Global Optimization, Springer, vol. 58(2), pages 285-319, February.
    7. Minghao Li & Zicheng Zhang & Qianrong Wang & Yan Yi & Baosheng Li, 2022. "Integrated cohort of esophageal squamous cell cancer reveals genomic features underlying clinical characteristics," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    8. José M. Maisog & Andrew T. DeMarco & Karthik Devarajan & Stanley Young & Paul Fogel & George Luta, 2021. "Assessing Methods for Evaluating the Number of Components in Non-Negative Matrix Factorization," Mathematics, MDPI, vol. 9(22), pages 1-13, November.
    9. Hui-Min Wang & Ching-Lin Hsiao & Ai-Ru Hsieh & Ying-Chao Lin & Cathy S J Fann, 2012. "Constructing Endophenotypes of Complex Diseases Using Non-Negative Matrix Factorization and Adjusted Rand Index," PLOS ONE, Public Library of Science, vol. 7(7), pages 1-12, July.
    10. Richard Nock & Natalia Polouliakh & Frank Nielsen & Keigo Oka & Carlin R Connell & Cedric Heimhofer & Kazuhiro Shibanai & Samik Ghosh & Ken-ichi Aisaki & Satoshi Kitajima & Jun Kanno & Taketo Akama & , 2020. "A Geometric Clustering Tool (AGCT) to robustly unravel the inner cluster structures of time-series gene expressions," PLOS ONE, Public Library of Science, vol. 15(7), pages 1-19, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:envsyd:v:38:y:2018:i:3:d:10.1007_s10669-017-9670-5. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.