IDEAS home Printed from https://ideas.repec.org/a/bla/jinfst/v74y2023i6p669-684.html
   My bibliography  Save this article

An expert‐in‐the‐loop method for domain‐specific document categorization based on small training data

Author

Listed:
  • Kanyao Han
  • Rezvaneh Rezapour
  • Katia Nakamura
  • Dikshya Devkota
  • Daniel C. Miller
  • Jana Diesner

Abstract

Automated text categorization methods are of broad relevance for domain experts since they free researchers and practitioners from manual labeling, save their resources (e.g., time, labor), and enrich the data with information helpful to study substantive questions. Despite a variety of newly developed categorization methods that require substantial amounts of annotated data, little is known about how to build models when (a) labeling texts with categories requires substantial domain expertise and/or in‐depth reading, (b) only a few annotated documents are available for model training, and (c) no relevant computational resources, such as pretrained models, are available. In a collaboration with environmental scientists who study the socio‐ecological impact of funded biodiversity conservation projects, we develop a method that integrates deep domain expertise with computational models to automatically categorize project reports based on a small sample of 93 annotated documents. Our results suggest that domain expertise can improve automated categorization and that the magnitude of these improvements is influenced by the experts' understanding of categories and their confidence in their annotation, as well as data sparsity and additional category characteristics such as the portion of exclusive keywords that can identify a category.

Suggested Citation

  • Kanyao Han & Rezvaneh Rezapour & Katia Nakamura & Dikshya Devkota & Daniel C. Miller & Jana Diesner, 2023. "An expert‐in‐the‐loop method for domain‐specific document categorization based on small training data," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 74(6), pages 669-684, June.
  • Handle: RePEc:bla:jinfst:v:74:y:2023:i:6:p:669-684
    DOI: 10.1002/asi.24714
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.24714
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.24714?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Miller, Daniel C., 2014. "Explaining Global Patterns of International Aid for Linked Biodiversity Conservation and Development," World Development, Elsevier, vol. 59(C), pages 341-359.
    2. Grimmer, Justin & Stewart, Brandon M., 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, Cambridge University Press, vol. 21(3), pages 267-297, July.
    3. Margaret Roberts & Brandon Stewart & Tingley, Dustin & Edoardo Airoldi, 2013. "The structural topic model and applied social science," Working Paper 132666, Harvard University OpenScholar.
    4. repec:nas:journl:v:115:y:2018:p:e3635-e3644 is not listed on IDEAS
    5. Arho Suominen & Hannes Toivanen, 2016. "Map of science with topic modeling: Comparison of unsupervised learning and human-assigned subject classification," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 67(10), pages 2464-2476, October.
    6. Anthony Waldron & Daniel C. Miller & Dave Redding & Arne Mooers & Tyler S. Kuhn & Nate Nibbelink & J. Timmons Roberts & Joseph A. Tobias & John L. Gittleman, 2017. "Reductions in global biodiversity loss predicted from conservation spending," Nature, Nature, vol. 551(7680), pages 364-367, November.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mohamed M. Mostafa, 2023. "A one-hundred-year structural topic modeling analysis of the knowledge structure of international management research," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3905-3935, August.
    2. van Loon, Austin, 2022. "Three Families of Automated Text Analysis," SocArXiv htnej, Center for Open Science.
    3. Keith Carlson & Michael A. Livermore & Daniel N. Rockmore, 2020. "The Problem of Data Bias in the Pool of Published U.S. Appellate Court Opinions," Journal of Empirical Legal Studies, John Wiley & Sons, vol. 17(2), pages 224-261, June.
    4. Wen Shi & Diyi Liu & Jing Yang & Jing Zhang & Sanmei Wen & Jing Su, 2020. "Social Bots’ Sentiment Engagement in Health Emergencies: A Topic-Based Analysis of the COVID-19 Pandemic Discussions on Twitter," IJERPH, MDPI, vol. 17(22), pages 1-18, November.
    5. David Ardia & Keven Bluteau & Mohammad‐Abbas Meghani, 2024. "Thirty years of academic finance," Journal of Economic Surveys, Wiley Blackwell, vol. 38(3), pages 1008-1042, July.
    6. Ben Cormier & Mark S. Manger, 2022. "Power, ideas, and World Bank conditionality," The Review of International Organizations, Springer, vol. 17(3), pages 397-425, July.
    7. Matthew Gentzkow & Bryan T. Kelly & Matt Taddy, 2017. "Text as Data," NBER Working Papers 23276, National Bureau of Economic Research, Inc.
    8. Pranav Goel & Nikolay Malkin & SoRelle W Gaynor & Nebojsa Jojic & Kristina Miler & Philip Resnik, 2023. "Donor activity is associated with US legislators’ attention to political issues," PLOS ONE, Public Library of Science, vol. 18(9), pages 1-24, September.
    9. Bernhardt, Lea & Dewenter, Ralf & Thomas, Tobias, 2023. "Measuring partisan media bias in US newscasts from 2001 to 2012," European Journal of Political Economy, Elsevier, vol. 78(C).
    10. Jeetendra Prakash Aryal & Tek B. Sapkota & Ritika Khurana & Arun Khatri-Chhetri & Dil Bahadur Rahut & M. L. Jat, 2020. "Climate change and agriculture in South Asia: adaptation options in smallholder production systems," Environment, Development and Sustainability: A Multidisciplinary Approach to the Theory and Practice of Sustainable Development, Springer, vol. 22(6), pages 5045-5075, August.
    11. Rauh, Christian, 2015. "Communicating supranational governance? The salience of EU affairs in the German Bundestag, 1991–2013," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 16(1), pages 116-138.
    12. Julia Seiermann, 2018. "Only Words? How Power in Trade Agreement Texts Affects International Trade Flows," UNCTAD Blue Series Papers 80, United Nations Conference on Trade and Development.
    13. Simplice Asongu & Uchenna Efobi & Ibukun Beecroft, 2015. "Inclusive Human Development in Pre-crisis Times of Globalization-driven Debts," African Development Review, African Development Bank, vol. 27(4), pages 428-442, December.
    14. Arthur Dyevre & Nicolas Lampach, 2021. "Issue attention on international courts: Evidence from the European Court of Justice," The Review of International Organizations, Springer, vol. 16(4), pages 793-815, October.
    15. Dewenter, Ralf & Dulleck, Uwe & Thomas, Tobias, 2018. "The political coverage index and its application to government capture," Research Papers 6, EcoAustria – Institute for Economic Research.
    16. Pastwa, Anna M. & Shrestha, Prabal & Thewissen, James & Torsin, Wouter, 2021. "Unpacking the black box of ICO white papers: a topic modeling approach," LIDAM Discussion Papers LFIN 2021018, Université catholique de Louvain, Louvain Finance (LFIN).
    17. Maksym Polyakov & Morteza Chalak & Md. Sayed Iftekhar & Ram Pandit & Sorada Tapsuwan & Fan Zhang & Chunbo Ma, 2018. "Authorship, Collaboration, Topics, and Research Gaps in Environmental and Resource Economics 1991–2015," Environmental & Resource Economics, Springer;European Association of Environmental and Resource Economists, vol. 71(1), pages 217-239, September.
    18. Milena Djourelova & Ruben Durante, 2019. "Media attention and strategic timing in politics: Evidence from U.S. presidential executive orders," Economics Working Papers 1675, Department of Economics and Business, Universitat Pompeu Fabra.
    19. Erkan Işığıçok & Sadullah Çelik & Dilek Özdemir Yılmaz, 2023. "Analysis of Skills and Qualifications Required in Data Scientist Job Postings Based on the Pareto Analysis Perspective Using Text Mining," EKOIST Journal of Econometrics and Statistics, Istanbul University, Faculty of Economics, vol. 0(39), pages 10-25, December.
    20. Susan, Enyang Besong & Pan, Yanchun, 2024. "Trust as a determinant of green finance through information sharing and technological penetration: Integrating the moderation of governance for sustainable growth," Technology in Society, Elsevier, vol. 77(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jinfst:v:74:y:2023:i:6:p:669-684. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.