IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2511.01680.html
   My bibliography  Save this paper

Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

Author

Listed:
  • Jacob Carlson

Abstract

Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights, e.g., estimating causal effects on text outcomes, measuring beliefs from open-ended survey responses. In such settings, unsupervised analysis is often of interest, in that the researcher does not want to pre-specify the objects of measurement or otherwise artificially delimit the space of measurable concepts; they are interested in discovery. This paper proposes a general and flexible framework for pursuing discovery from unstructured data in a statistically principled way. The framework leverages recent methods from the literature on machine learning interpretability to map unstructured data points to high-dimensional, sparse, and interpretable dictionaries of concepts; computes (test) statistics of these dictionary entries; and then performs selective inference on them using newly developed statistical procedures for high-dimensional exceedance control of the $k$-FWER under arbitrary dependence. The proposed framework has few researcher degrees of freedom, is fully replicable, and is cheap to implement -- both in terms of financial cost and researcher time. Applications to recent descriptive and causal analyses of unstructured data in empirical economics are explored. An open source Jupyter notebook is provided for researchers to implement the framework in their own projects.

Suggested Citation

  • Jacob Carlson, 2025. "Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach," Papers 2511.01680, arXiv.org.
  • Handle: RePEc:arx:papers:2511.01680
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2511.01680
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Ingar Haaland & Christopher Roth & Stefanie Stantcheva & Johannes Wohlfart, 2025. "Understanding Economic Behavior Using Open-Ended Survey Data," Journal of Economic Literature, American Economic Association, vol. 63(4), pages 1244-1280, December.
    2. Jens Ludwig & Sendhil Mullainathan, 2024. "Machine Learning as a Tool for Hypothesis Generation," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 139(2), pages 751-827.
    3. Clément Gorin & Stephan Heblich & Yanos Zylberberg, 2025. "State of the Art: Economic Development Through the Lens of Paintings," Bristol Economics Discussion Papers 25/793, School of Economics, University of Bristol, UK.
    4. Melissa Dell, 2025. "Deep Learning for Economists," Journal of Economic Literature, American Economic Association, vol. 63(1), pages 5-58, March.
    5. Sendhil Mullainathan & Jann Spiess, 2017. "Machine Learning: An Applied Econometric Approach," Journal of Economic Perspectives, American Economic Association, vol. 31(2), pages 87-106, Spring.
    6. Stefanie Stantcheva, 2024. "Why Do We Dislike Inflation?," Brookings Papers on Economic Activity, Economic Studies Program, The Brookings Institution, vol. 55(1 (Spring), pages 1-65.
    7. Stefanie Stantcheva, 2024. "Why Do We Dislike Inflation?," NBER Working Papers 32300, National Bureau of Economic Research, Inc.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Hajdini, Ina & Knotek, Edward & Leer, John & Pedemonte, Mathieu & Rich, Robert & Schoenle, Raphael, 2022. "Low Passthrough from Inflation Expectations to Income Growth Expectations: Why People Dislike Inflation," CEPR Discussion Papers 17356, C.E.P.R. Discussion Papers.
    2. Afrouzi, Hassan & Priftis, Romanos & Dietrich, Alexander M. & Myrseth, Kristian Ove R. & Schoenle, Raphael S., 2024. "Inflation preferences," Working Paper Series 2957, European Central Bank.
    3. Olena Kostyshyna & Isabelle Salle & Hung Truong, 2025. "Anchored Inflation Expectations: What Recent Data Reveal," Staff Working Papers 25-5, Bank of Canada.
    4. Kantorowicz, Jaroslaw & Metelska-Szaniawska, Katarzyna, 2025. "Debt beliefs and public support for restrictive fiscal rules," Economics Letters, Elsevier, vol. 247(C).
    5. Binetti, Alberto & Nuzzi, Francesco & Stantcheva, Stefanie, 2024. "People’s understanding of inflation," Journal of Monetary Economics, Elsevier, vol. 148(S).
    6. DiGiuseppe, Matthew & Garriga, Ana Carolina & Kern, Andreas, 2025. "Partisan Bias in Inflation Expectations," MPRA Paper 124391, University Library of Munich, Germany.
    7. Samuel Chang & Andrew Kennedy & Aaron Leonard & John A. List, 2024. "12 Best Practices for Leveraging Generative AI in Experimental Research," NBER Working Papers 33025, National Bureau of Economic Research, Inc.
    8. Carola Conces Binder & Rupal Kamdar & Jane M. Ryngaert, 2024. "Partisan Expectations and COVID-Era Inflation," NBER Chapters, in: Inflation in the COVID Era and Beyond, National Bureau of Economic Research, Inc.
    9. Çekin, Semih Emre & Polattimur, Hamza, 2025. "Televised inflation: Measuring TV news coverage and its effect on household expectations," ZEW Discussion Papers 25-051, ZEW - Leibniz Centre for European Economic Research.
    10. Chenyu Hou & Tao Wang, 2025. "Uncovering Subjective Models from Survey Expectations," Staff Working Papers 25-31, Bank of Canada.
    11. Hashim, Zeeshan & Fidrmuc, Jan & Ghosh, Sugata, 2025. "Political parties’ ideological bias and convergence in economic outcomes," European Journal of Political Economy, Elsevier, vol. 87(C).
    12. Feyzollahi, Maryam & Rafizadeh, Nima, 2025. "The adoption of Large Language Models in economics research," Economics Letters, Elsevier, vol. 250(C).
    13. Annie Liang, 2025. "Using Machine Learning to Generate, Clarify, and Improve Economic Models," Papers 2508.19136, arXiv.org.
    14. Ajay Agrawal & John McHale & Alexander Oettl, 2025. "Comment on "Science in the Age of Algorithms"," NBER Chapters, in: The Economics of Transformative AI, National Bureau of Economic Research, Inc.
    15. Chenyu Hou and Tao Wang, 2024. "Uncovering Subjective Models from Survey Expectations," Discussion Papers dp24-09, Department of Economics, Simon Fraser University.
    16. Paker, Meredith & Stephenson, Judy & Wallis, Patrick, 2025. "Predictive modeling the past," LSE Research Online Documents on Economics 128852, London School of Economics and Political Science, LSE Library.
    17. Shiller, Robert J., 2024. "Comments on Alberto Binetti, Francesco Nuzzi, and Stefanie Stantcheva “people's understanding of inflation”," Journal of Monetary Economics, Elsevier, vol. 148(S).
    18. Zehao Lin & Ying Liu & Congrong Pan & Lutz Sager, 2025. "Can Air Pollution Affect Our Sentiments: Social Media Evidence from Japan," CESifo Working Paper Series 12030, CESifo.
    19. Joshua Foster & Fredrik Odegaard, 2025. "Decoding Consumer Preferences Using Attention-Based Language Models," Papers 2507.17564, arXiv.org.
    20. Sophie-Charlotte Klose & Johannes Lederer, 2020. "A Pipeline for Variable Selection and False Discovery Rate Control With an Application in Labor Economics," Papers 2006.12296, arXiv.org, revised Jun 2020.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2511.01680. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.