Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

Author

Listed:

Jacob Carlson

Abstract

Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights, e.g., estimating descriptive statistics of or causal effects on quantitative measures derived from text, audio, or video data. In many such settings, unsupervised analysis is of primary interest, in that the researcher does not want to (or cannot) pre-specify all important aspects of the unstructured data to measure; they are interested in "discovery." This paper proposes a general and flexible framework for pursuing discovery from unstructured data in a statistically principled way. The framework leverages recent methods from the literature on machine learning interpretability to map unstructured data points to high-dimensional, sparse, and interpretable "dictionaries" of concepts; computes statistics of dictionary entries for testing relevant concept-level hypotheses; performs selective inference on these hypotheses using algorithms validated by new results in high-dimensional central limit theory, producing a selected set ("discoveries"); and both generates and evaluates human-interpretable natural language descriptions of these discoveries. The proposed framework has few researcher degrees of freedom, is fully replicable, and is cheap to implement -- both in terms of financial cost and researcher time. Applications to recent descriptive and causal analyses of unstructured data in empirical economics are explored. An open source Jupyter notebook is provided for researchers to implement the framework in their own projects.

Suggested Citation

Jacob Carlson, 2025. "Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach," Papers 2511.01680, arXiv.org, revised Jan 2026.

Handle: RePEc:arx:papers:2511.01680

Download full text from publisher

References listed on IDEAS

Ingar Haaland & Christopher Roth & Stefanie Stantcheva & Johannes Wohlfart, 2025. "Understanding Economic Behavior Using Open-Ended Survey Data," Journal of Economic Literature, American Economic Association, vol. 63(4), pages 1244-1280, December.
- Ingar K. Haaland & Christopher Roth & Stefanie Stantcheva & Johannes Wohlfart, 2024. "Understanding Economic Behavior Using Open-ended Survey Data," NBER Working Papers 32421, National Bureau of Economic Research, Inc.
- Ingar Haaland & Christopher Roth & Stefanie Stantcheva & Johannes Wohlfart, 2025. "Understanding Economic Behavior Using Open-Ended Survey Data," ECONtribute Discussion Papers Series 362, University of Bonn and University of Cologne, Germany.
Jens Ludwig & Sendhil Mullainathan, 2024. "Machine Learning as a Tool for Hypothesis Generation," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 139(2), pages 751-827.
- Jens Ludwig & Sendhil Mullainathan, 2023. "Machine Learning as a Tool for Hypothesis Generation," NBER Working Papers 31017, National Bureau of Economic Research, Inc.
ClÃ©ment Gorin & Stephan Heblich & Yanos Zylberberg, 2025. "State of the Art: Economic Development Through the Lens of Paintings," Bristol Economics Discussion Papers 25/793, School of Economics, University of Bristol, UK.
- Clément Gorin & Stephan Heblich & Yanos Zylberberg, 2025. "State of the Art: Economic Development Through the Lens of Paintings," NBER Working Papers 33976, National Bureau of Economic Research, Inc.
Sendhil Mullainathan & Jann Spiess, 2017. "Machine Learning: An Applied Econometric Approach," Journal of Economic Perspectives, American Economic Association, vol. 31(2), pages 87-106, Spring.
Stefanie Stantcheva, 2024. "Why Do We Dislike Inflation?," Brookings Papers on Economic Activity, Economic Studies Program, The Brookings Institution, vol. 55(1 (Spring), pages 1-65.
Melissa Dell, 2025. "Deep Learning for Economists," Journal of Economic Literature, American Economic Association, vol. 63(1), pages 5-58, March.
- Melissa Dell, 2024. "Deep Learning for Economists," NBER Working Papers 32768, National Bureau of Economic Research, Inc.
- Melissa Dell, 2024. "Deep Learning for Economists," Papers 2407.15339, arXiv.org, revised Nov 2024.
Stefanie Stantcheva, 2024. "Why Do We Dislike Inflation?," NBER Working Papers 32300, National Bureau of Economic Research, Inc.

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Olena Kostyshyna & Isabelle Salle & Hung Truong, 2025. "Anchored Inflation Expectations: What Recent Data Reveal," Staff Working Papers 25-5, Bank of Canada.
Kantorowicz, Jaroslaw & Metelska-Szaniawska, Katarzyna, 2025. "Debt beliefs and public support for restrictive fiscal rules," Economics Letters, Elsevier, vol. 247(C).
Binetti, Alberto & Nuzzi, Francesco & Stantcheva, Stefanie, 2024. "People’s understanding of inflation," Journal of Monetary Economics, Elsevier, vol. 148(S).
Ina Hajdini & Edward S. Knotek & John Leer & Mathieu Pedemonte & Robert W. Rich & Raphael Schoenle, 2022. "Low Passthrough from Inflation Expectations to Income Growth Expectations: Why People Dislike Inflation," Working Papers 22-21R, Federal Reserve Bank of Cleveland, revised 27 Mar 2023.
- Hajdini, Ina & Knotek, Edward S & Leer, John & Pedemonte, Mathieu & Rich, Robert & Schoenle, Raphael, 2025. "Low Pass-Through from Inflation Expectations to Income Growth Expectations: Why People Dislike Inflation," IDB Publications (Working Papers) 13937, Inter-American Development Bank.
- Hajdini, Ina & Knotek, Edward & Leer, John & Pedemonte, Mathieu & Rich, Robert & Schoenle, Raphael, 2022. "Low Passthrough from Inflation Expectations to Income Growth Expectations: Why People Dislike Inflation," CEPR Discussion Papers 17356, C.E.P.R. Discussion Papers.
DiGiuseppe, Matthew & Garriga, Ana Carolina & Kern, Andreas, 2025. "Partisan Bias in Inflation Expectations," MPRA Paper 124391, University Library of Munich, Germany.
Samuel Chang & Andrew Kennedy & Aaron Leonard & John A. List, 2024. "12 Best Practices for Leveraging Generative AI in Experimental Research," NBER Working Papers 33025, National Bureau of Economic Research, Inc.
- Samuel Chang & Andrew Kennedy & Aaron Leonard & John List, 2024. "12 Best Practices for Leveraging Generative AI in Experimental Research," Artefactual Field Experiments 00796, The Field Experiments Website.
Krämer Andreas, 2025. "Lücke zwischen gefühlter und gemessener Inflation? Eine empirische Bestandsaufnahme," Wirtschaftsdienst, Sciendo, vol. 105(11), pages 821-827.
Sukjin Han & Kyungho Lee, 2025. "Copyright and Competition: Estimating Supply and Demand with Unstructured Data," Bristol Economics Discussion Papers 25/816, School of Economics, University of Bristol, UK.
Trebbi, Giovanni, 2025. "Inflation narratives and expectations," Working Paper Series 3158, European Central Bank.
Carola Conces Binder & Rupal Kamdar & Jane M. Ryngaert, 2024. "Partisan Expectations and COVID-Era Inflation," NBER Chapters, in: Inflation in the COVID Era and Beyond, National Bureau of Economic Research, Inc.
- Binder, Carola Conces & Kamdar, Rupal & Ryngaert, Jane M., 2024. "Partisan expectations and COVID-era inflation," Journal of Monetary Economics, Elsevier, vol. 148(S).
- Carola Binder & Rupal Kamdar & Jane M. Ryngaert, 2024. "Partisan Expectations and COVID-Era Inflation," NBER Working Papers 32650, National Bureau of Economic Research, Inc.
Çekin, Semih Emre & Polattimur, Hamza, 2025. "Televised inflation: Measuring TV news coverage and its effect on household expectations," ZEW Discussion Papers 25-051, ZEW - Leibniz Centre for European Economic Research.
Chenyu Hou & Tao Wang, 2025. "Uncovering Subjective Models from Survey Expectations," Staff Working Papers 25-31, Bank of Canada.
Hashim, Zeeshan & Fidrmuc, Jan & Ghosh, Sugata, 2025. "Political parties’ ideological bias and convergence in economic outcomes," European Journal of Political Economy, Elsevier, vol. 87(C).
- Zeeshan Hashim & Jan Fidrmuc & Sugata Ghosh, 2025. "Political parties’ ideological bias and convergence in economic outcomes," Post-Print hal-05108116, HAL.
Feyzollahi, Maryam & Rafizadeh, Nima, 2025. "The adoption of Large Language Models in economics research," Economics Letters, Elsevier, vol. 250(C).
Annie Liang, 2025. "Using Machine Learning to Generate, Clarify, and Improve Economic Models," Papers 2508.19136, arXiv.org.
Ajay Agrawal & John McHale & Alexander Oettl, 2025. "Comment on "Science in the Age of Algorithms"," NBER Chapters, in: The Economics of Transformative AI, National Bureau of Economic Research, Inc.
Chenyu Hou and Tao Wang, 2024. "Uncovering Subjective Models from Survey Expectations," Discussion Papers dp24-09, Department of Economics, Simon Fraser University.
Afrouzi, Hassan & Dietrich, Alexander & Myrseth, Kristian & Priftis, Romanos & Schoenle, Raphael, 2024. "Inflation Preferences," CEPR Discussion Papers 19006, C.E.P.R. Discussion Papers.
- Afrouzi, Hassan & Priftis, Romanos & Dietrich, Alexander M. & Myrseth, Kristian Ove R. & Schoenle, Raphael S., 2024. "Inflation preferences," Working Paper Series 2957, European Central Bank.
- Hassan Afrouzi & Alexander Dietrich & Kristian Myrseth & Romanos Priftis & Raphael Schoenle, 2024. "Inflation Preferences," NBER Working Papers 32379, National Bureau of Economic Research, Inc.
Paker, Meredith & Stephenson, Judy & Wallis, Patrick, 2025. "Predictive modeling the past," LSE Research Online Documents on Economics 128852, London School of Economics and Political Science, LSE Library.
Shiller, Robert J., 2024. "Comments on Alberto Binetti, Francesco Nuzzi, and Stefanie Stantcheva “people's understanding of inflation”," Journal of Monetary Economics, Elsevier, vol. 148(S).

More about this item

NEP fields

This paper has been announced in the following NEP Reports:

NEP-CMP-2025-11-10 (Computational Economics)
NEP-ECM-2025-11-10 (Econometrics)

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2511.01680. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

NEP fields

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data