IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2511.01680.html

Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

Author

Listed:
  • Jacob Carlson

Abstract

Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights, e.g., estimating descriptive statistics of or causal effects on quantitative measures derived from text, audio, or video data. In many such settings, unsupervised analysis is of primary interest, in that the researcher does not want to (or cannot) pre-specify all important aspects of the unstructured data to measure; they are interested in "discovery." This paper proposes a general and flexible framework for pursuing discovery from unstructured data in a statistically principled way. The framework leverages recent methods from the literature on machine learning interpretability to map unstructured data points to high-dimensional, sparse, and interpretable "dictionaries" of concepts; computes statistics of dictionary entries for testing relevant concept-level hypotheses; performs selective inference on these hypotheses using algorithms validated by new results in high-dimensional central limit theory, producing a selected set ("discoveries"); and both generates and evaluates human-interpretable natural language descriptions of these discoveries. The proposed framework has few researcher degrees of freedom, is fully replicable, and is cheap to implement -- both in terms of financial cost and researcher time. Applications to recent descriptive and causal analyses of unstructured data in empirical economics are explored. An open source Jupyter notebook is provided for researchers to implement the framework in their own projects.

Suggested Citation

  • Jacob Carlson, 2025. "Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach," Papers 2511.01680, arXiv.org, revised Jan 2026.
  • Handle: RePEc:arx:papers:2511.01680
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2511.01680
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Ingar Haaland & Christopher Roth & Stefanie Stantcheva & Johannes Wohlfart, 2025. "Understanding Economic Behavior Using Open-Ended Survey Data," Journal of Economic Literature, American Economic Association, vol. 63(4), pages 1244-1280, December.
    2. Jens Ludwig & Sendhil Mullainathan, 2024. "Machine Learning as a Tool for Hypothesis Generation," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 139(2), pages 751-827.
    3. Clément Gorin & Stephan Heblich & Yanos Zylberberg, 2025. "State of the Art: Economic Development Through the Lens of Paintings," Bristol Economics Discussion Papers 25/793, School of Economics, University of Bristol, UK.
    4. Sendhil Mullainathan & Jann Spiess, 2017. "Machine Learning: An Applied Econometric Approach," Journal of Economic Perspectives, American Economic Association, vol. 31(2), pages 87-106, Spring.
    5. Stefanie Stantcheva, 2024. "Why Do We Dislike Inflation?," Brookings Papers on Economic Activity, Economic Studies Program, The Brookings Institution, vol. 55(1 (Spring), pages 1-65.
    6. Melissa Dell, 2025. "Deep Learning for Economists," Journal of Economic Literature, American Economic Association, vol. 63(1), pages 5-58, March.
    7. Stefanie Stantcheva, 2024. "Why Do We Dislike Inflation?," NBER Working Papers 32300, National Bureau of Economic Research, Inc.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Olena Kostyshyna & Isabelle Salle & Hung Truong, 2025. "Anchored Inflation Expectations: What Recent Data Reveal," Staff Working Papers 25-5, Bank of Canada.
    2. Kantorowicz, Jaroslaw & Metelska-Szaniawska, Katarzyna, 2025. "Debt beliefs and public support for restrictive fiscal rules," Economics Letters, Elsevier, vol. 247(C).
    3. Binetti, Alberto & Nuzzi, Francesco & Stantcheva, Stefanie, 2024. "People’s understanding of inflation," Journal of Monetary Economics, Elsevier, vol. 148(S).
    4. Ina Hajdini & Edward S. Knotek & John Leer & Mathieu Pedemonte & Robert W. Rich & Raphael Schoenle, 2022. "Low Passthrough from Inflation Expectations to Income Growth Expectations: Why People Dislike Inflation," Working Papers 22-21R, Federal Reserve Bank of Cleveland, revised 27 Mar 2023.
    5. DiGiuseppe, Matthew & Garriga, Ana Carolina & Kern, Andreas, 2025. "Partisan Bias in Inflation Expectations," MPRA Paper 124391, University Library of Munich, Germany.
    6. Samuel Chang & Andrew Kennedy & Aaron Leonard & John A. List, 2024. "12 Best Practices for Leveraging Generative AI in Experimental Research," NBER Working Papers 33025, National Bureau of Economic Research, Inc.
    7. Krämer Andreas, 2025. "Lücke zwischen gefühlter und gemessener Inflation? Eine empirische Bestandsaufnahme," Wirtschaftsdienst, Sciendo, vol. 105(11), pages 821-827.
    8. Sukjin Han & Kyungho Lee, 2025. "Copyright and Competition: Estimating Supply and Demand with Unstructured Data," Bristol Economics Discussion Papers 25/816, School of Economics, University of Bristol, UK.
    9. Trebbi, Giovanni, 2025. "Inflation narratives and expectations," Working Paper Series 3158, European Central Bank.
    10. Carola Conces Binder & Rupal Kamdar & Jane M. Ryngaert, 2024. "Partisan Expectations and COVID-Era Inflation," NBER Chapters, in: Inflation in the COVID Era and Beyond, National Bureau of Economic Research, Inc.
    11. Çekin, Semih Emre & Polattimur, Hamza, 2025. "Televised inflation: Measuring TV news coverage and its effect on household expectations," ZEW Discussion Papers 25-051, ZEW - Leibniz Centre for European Economic Research.
    12. Chenyu Hou & Tao Wang, 2025. "Uncovering Subjective Models from Survey Expectations," Staff Working Papers 25-31, Bank of Canada.
    13. Hashim, Zeeshan & Fidrmuc, Jan & Ghosh, Sugata, 2025. "Political parties’ ideological bias and convergence in economic outcomes," European Journal of Political Economy, Elsevier, vol. 87(C).
    14. Feyzollahi, Maryam & Rafizadeh, Nima, 2025. "The adoption of Large Language Models in economics research," Economics Letters, Elsevier, vol. 250(C).
    15. Annie Liang, 2025. "Using Machine Learning to Generate, Clarify, and Improve Economic Models," Papers 2508.19136, arXiv.org.
    16. Ajay Agrawal & John McHale & Alexander Oettl, 2025. "Comment on "Science in the Age of Algorithms"," NBER Chapters, in: The Economics of Transformative AI, National Bureau of Economic Research, Inc.
    17. Chenyu Hou and Tao Wang, 2024. "Uncovering Subjective Models from Survey Expectations," Discussion Papers dp24-09, Department of Economics, Simon Fraser University.
    18. Afrouzi, Hassan & Dietrich, Alexander & Myrseth, Kristian & Priftis, Romanos & Schoenle, Raphael, 2024. "Inflation Preferences," CEPR Discussion Papers 19006, C.E.P.R. Discussion Papers.
    19. Paker, Meredith & Stephenson, Judy & Wallis, Patrick, 2025. "Predictive modeling the past," LSE Research Online Documents on Economics 128852, London School of Economics and Political Science, LSE Library.
    20. Shiller, Robert J., 2024. "Comments on Alberto Binetti, Francesco Nuzzi, and Stefanie Stantcheva “people's understanding of inflation”," Journal of Monetary Economics, Elsevier, vol. 148(S).

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2511.01680. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.