IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2601.05374.html

From Unstructured Data to Demand Counterfactuals: Theory and Practice

Author

Listed:
  • Timothy Christensen
  • Giovanni Compiani

Abstract

Empirical models of demand for differentiated products rely on low-dimensional product representations to capture substitution patterns. These representations are increasingly proxied by applying ML methods to high-dimensional, unstructured data, including product descriptions and images. When proxies fail to capture the true dimensions of differentiation that drive substitution, standard workflows will deliver biased counterfactuals and invalid inference. We develop a practical toolkit that corrects this bias and ensures valid inference for a broad class of counterfactuals. Our approach applies to market-level and/or individual data, requires minimal additional computation, is efficient, delivers simple formulas for standard errors, and accommodates data-dependent proxies, including embeddings from fine-tuned ML models. It can also be used with standard quantitative attributes when mismeasurement is a concern. In addition, we propose diagnostics to assess the adequacy of the proxy construction and dimension. The approach yields meaningful improvements in predicting counterfactual substitution in both simulations and an empirical application.

Suggested Citation

  • Timothy Christensen & Giovanni Compiani, 2026. "From Unstructured Data to Demand Counterfactuals: Theory and Practice," Papers 2601.05374, arXiv.org.
  • Handle: RePEc:arx:papers:2601.05374
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2601.05374
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Fong, Christian & Tyler, Matthew, 2021. "Machine Learning Predictions as Regression Covariates," Political Analysis, Cambridge University Press, vol. 29(4), pages 467-484, October.
    2. Hansen, Bruce E, 1996. "Inference When a Nuisance Parameter Is Not Identified under the Null Hypothesis," Econometrica, Econometric Society, vol. 64(2), pages 413-430, March.
    3. Hahn, Jinyong & Kuersteiner, Guido & Mazzocco, Maurizio, 2022. "Joint Time-Series And Cross-Section Limit Theory Under Mixingale Assumptions," Econometric Theory, Cambridge University Press, vol. 38(5), pages 942-958, October.
    4. Lorenzo Magnolfi & Jonathon McClure & Alan Sorensen, 2025. "Triplet Embeddings for Demand Estimation," American Economic Journal: Microeconomics, American Economic Association, vol. 17(1), pages 282-307, February.
    5. Newey, Whitney K, 1994. "The Asymptotic Variance of Semiparametric Estimators," Econometrica, Econometric Society, vol. 62(6), pages 1349-1382, November.
    6. Steven T. Berry & Philip A. Haile, 2014. "Identification in Differentiated Products Markets Using Market Level Data," Econometrica, Econometric Society, vol. 82, pages 1749-1797, September.
    7. Andrews, Donald W K, 1994. "Asymptotics for Semiparametric Econometric Models via Stochastic Equicontinuity," Econometrica, Econometric Society, vol. 62(1), pages 43-72, January.
    8. Goldberg, Pinelopi Koujianou, 1995. "Product Differentiation and Oligopoly in International Markets: The Case of the U.S. Automobile Industry," Econometrica, Econometric Society, vol. 63(4), pages 891-951, July.
    9. Amil Petrin, 2002. "Quantifying the Benefits of New Products: The Case of the Minivan," Journal of Political Economy, University of Chicago Press, vol. 110(4), pages 705-729, August.
    10. Laura Battaglia & Timothy Christensen & Stephen Hansen & Szymon Sacher, 2024. "Inference for Regression with Variables Generated from Unstructured Data," CESifo Working Paper Series 11119, CESifo.
    11. Robin S. Lee, 2013. "Vertical Integration and Exclusivity in Platform and Two-Sided Markets," American Economic Review, American Economic Association, vol. 103(7), pages 2960-3000, December.
    12. Patrick Bayer & Fernando Ferreira & Robert McMillan, 2007. "A Unified Framework for Measuring Preferences for Schools and Neighborhoods," Journal of Political Economy, University of Chicago Press, vol. 115(4), pages 588-638, August.
    13. Ying Fan, 2013. "Ownership Consolidation and Product Characteristics: A Study of the US Daily Newspaper Market," American Economic Review, American Economic Association, vol. 103(5), pages 1598-1628, August.
    14. Paul L. E. Grieco & Charles Murry & Joris Pinkse & Stephan Sagl, 2025. "Optimal Estimation of Discrete Choice Demand Models with Consumer and Product Data," NBER Working Papers 33397, National Bureau of Economic Research, Inc.
    15. Matthew Backus & Christopher Conlon & Michael Sinkinson, 2021. "Common Ownership and Competition in the Ready-to-Eat Cereal Industry," NBER Working Papers 28350, National Bureau of Economic Research, Inc.
    16. Aviv Nevo, 2000. "Mergers with Differentiated Products: The Case of the Ready-to-Eat Cereal Industry," RAND Journal of Economics, The RAND Corporation, vol. 31(3), pages 395-421, Autumn.
    17. Sukjin Han & Kyungho Lee, 2025. "Copyright and Competition: Estimating Supply and Demand with Unstructured Data," Papers 2501.16120, arXiv.org, revised Sep 2025.
    18. Freyberger, Joachim, 2015. "Asymptotic theory for differentiated products demand models with many markets," Journal of Econometrics, Elsevier, vol. 185(1), pages 162-181.
    19. Jerry A. Hausman, 1996. "Valuation of New Goods under Perfect and Imperfect Competition," NBER Chapters, in: The Economics of New Goods, pages 207-248, National Bureau of Economic Research, Inc.
    20. Bryan W. Brown & Whitney K. Newey, 1998. "Efficient Semiparametric Estimation of Expectations," Econometrica, Econometric Society, vol. 66(2), pages 453-464, March.
    21. Ai, Chunrong & Chen, Xiaohong, 2012. "The semiparametric efficiency bound for models of sequential moment restrictions containing unknown functions," Journal of Econometrics, Elsevier, vol. 170(2), pages 442-457.
    22. Philipp Bach & Victor Chernozhukov & Sven Klaassen & Martin Spindler & Jan Teichert-Kluge & Suhas Vijaykumar, 2024. "Adventures in Demand Analysis Using AI," Papers 2501.00382, arXiv.org, revised Feb 2026.
    23. Jacob Carlson & Melissa Dell, 2025. "A Unifying Framework for Robust and Efficient Inference with Unstructured Data," Papers 2505.00282, arXiv.org, revised Feb 2026.
    24. Xiaohong Chen & Han Hong & Elie Tamer, 2005. "Measurement Error Models with Auxiliary Data," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 72(2), pages 343-366.
    25. Berry, Steven & Levinsohn, James & Pakes, Ariel, 1995. "Automobile Prices in Market Equilibrium," Econometrica, Econometric Society, vol. 63(4), pages 841-890, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Steven T. Berry & Philip A. Haile, 2021. "Foundations of Demand Estimation," NBER Working Papers 29305, National Bureau of Economic Research, Inc.
    2. Steven T. Berry & Philip A. Haile, 2024. "Nonparametric Identification of Differentiated Products Demand Using Micro Data," Econometrica, Econometric Society, vol. 92(4), pages 1135-1162, July.
    3. David P. Byrne & Susumu Imai & Vasilis Sarafidis & Masayuki Hirukawa, 2015. "Instrument-free Identification and Estimation of Differentiated Products Models," Working Paper Series 26, Economics Discipline Group, UTS Business School, University of Technology, Sydney.
    4. Byrne, David P. & Imai, Susumu & Jain, Neelam & Sarafidis, Vasilis, 2022. "Instrument-free identification and estimation of differentiated products models using cost data," Journal of Econometrics, Elsevier, vol. 228(2), pages 278-301.
    5. Byrne, D. P. & Imai, S. & Jain, N. & Sarafidis, V. & Hirukawa, M., 2020. "Identification and Estimation of Differentiated Products Models using Cost Data," Working Papers 15/05, Department of Economics, City St George's, University of London.
    6. Wang, Ao, 2021. "A BLP Demand Model of Product-Level Market Shares with Complementarity," The Warwick Economics Research Paper Series (TWERPS) 1351, University of Warwick, Department of Economics.
    7. Amit Gandhi & Jean-François Houde, 2019. "Measuring Substitution Patterns in Differentiated-Products Industries," NBER Working Papers 26375, National Bureau of Economic Research, Inc.
    8. David P. Byrne & Susumu Imai & Neelam Jain & Vasilis Sarafidis & Masayuki Hirukawa, 2019. "Identification and Estimation of Differentiated Products Models," Monash Econometrics and Business Statistics Working Papers 33/19, Monash University, Department of Econometrics and Business Statistics.
    9. Christopher Conlon & Jeff Gortmaker, 2020. "Best practices for differentiated products demand estimation with PyBLP," RAND Journal of Economics, RAND Corporation, vol. 51(4), pages 1108-1161, December.
    10. Wang, Ao, 2023. "Sieve BLP: A semi-nonparametric model of demand for differentiated products," Journal of Econometrics, Elsevier, vol. 235(2), pages 325-351.
    11. Steven T. Berry & Philip A. Haile, 2009. "Nonparametric Identification of Multinomial Choice Demand Models with Heterogeneous Consumers," NBER Working Papers 15276, National Bureau of Economic Research, Inc.
    12. Isaiah Andrews & Matthew Gentzkow & Jesse M. Shapiro, 2017. "Measuring the Sensitivity of Parameter Estimates to Estimation Moments," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 132(4), pages 1553-1592.
    13. Steven T. Berry & Philip A. Haile, 2014. "Identification in Differentiated Products Markets Using Market Level Data," Econometrica, Econometric Society, vol. 82(5), pages 1749-1797, September.
    14. Miravete, Eugenio J. & Seim, Katja & Thurk, Jeff, 2023. "Pass-through and tax incidence in differentiated product markets," International Journal of Industrial Organization, Elsevier, vol. 90(C).
    15. Gautam Gowrisankaran & Marc Rysman, 2012. "Dynamics of Consumer Demand for New Durable Goods," Journal of Political Economy, University of Chicago Press, vol. 120(6), pages 1173-1219.
    16. Iaria, Alessandro & ,, 2020. "Identification and Estimation of Demand for Bundles," CEPR Discussion Papers 14363, C.E.P.R. Discussion Papers.
    17. Victor Aguirregabiria & Margaret Slade, 2017. "Empirical models of firms and industries," Canadian Journal of Economics, Canadian Economics Association, vol. 50(5), pages 1445-1488, December.
    18. Pietro Tebaldi & Alexander Torgovitsky & Hanbin Yang, 2023. "Nonparametric Estimates of Demand in the California Health Insurance Exchange," Econometrica, Econometric Society, vol. 91(1), pages 107-146, January.
    19. Lu, Zhentong & Shi, Xiaoxia & Tao, Jing, 2023. "Semi-nonparametric estimation of random coefficients logit model for aggregate demand," Journal of Econometrics, Elsevier, vol. 235(2), pages 2245-2265.
    20. Victor Aguirregabiria & Hui Liu & Yao Luo, 2026. "Nested Pseudo-GMM Estimation of Demand for Differentiated Products," Papers 2602.05137, arXiv.org, revised Feb 2026.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2601.05374. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.