IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2012.10790.html
   My bibliography  Save this paper

Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem

Author

Listed:
  • Mochen Yang
  • Edward McFowland III
  • Gordon Burtch
  • Gediminas Adomavicius

Abstract

Combining machine learning with econometric analysis is becoming increasingly prevalent in both research and practice. A common empirical strategy involves the application of predictive modeling techniques to 'mine' variables of interest from available data, followed by the inclusion of those variables into an econometric framework, with the objective of estimating causal effects. Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest. We propose employing random forest not just for prediction, but also for generating instrumental variables to address the measurement error embedded in the prediction. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, yet which also make 'different' mistakes, i.e., have weakly correlated prediction errors. A key observation is that these properties are closely related to the relevance and exclusion requirements of valid instrumental variables. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the other trees serve as its instruments. Simulation experiments demonstrate the efficacy of the proposed approach in mitigating estimation biases and its superior performance over three alternative methods for bias correction.

Suggested Citation

  • Mochen Yang & Edward McFowland III & Gordon Burtch & Gediminas Adomavicius, 2020. "Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem," Papers 2012.10790, arXiv.org.
  • Handle: RePEc:arx:papers:2012.10790
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2012.10790
    File Function: Latest version
    Download Restriction: no
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Mammen, Enno & Rothe, Christoph & Schienle, Melanie, 2016. "Semiparametric Estimation With Generated Covariates," Econometric Theory, Cambridge University Press, vol. 32(5), pages 1140-1177, October.
    2. Victor Chernozhukov & Denis Chetverikov & Mert Demirer & Esther Duflo & Christian Hansen & Whitney Newey & James Robins, 2018. "Double/debiased machine learning for treatment and structural parameters," Econometrics Journal, Royal Economic Society, vol. 21(1), pages 1-68, February.
    3. Kuchenhoff, Helmut & Lederer, Wolfgang & Lesaffre, Emmanuel, 2007. "Asymptotic variance estimation for the misclassification SIMEX," Computational Statistics & Data Analysis, Elsevier, vol. 51(12), pages 6197-6211, August.
    4. Richard W. Blundell & James L. Powell, 2004. "Endogeneity in Semiparametric Binary Response Models," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 71(3), pages 655-679.
    5. Blaser, Rico & Fryzlewicz, Piotr, 2016. "Random rotation ensembles," LSE Research Online Documents on Economics 62182, London School of Economics and Political Science, LSE Library.
    6. Hausman, Jerry, 2015. "Specification tests in econometrics," Applied Econometrics, Russian Presidential Academy of National Economy and Public Administration (RANEPA), vol. 38(2), pages 112-134.
    7. Yingyao Hu & Susanne M. Schennach, 2008. "Instrumental Variable Treatment of Nonclassical Measurement Error Models," Econometrica, Econometric Society, vol. 76(1), pages 195-216, January.
    8. A. Belloni & D. Chen & V. Chernozhukov & C. Hansen, 2012. "Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain," Econometrica, Econometric Society, vol. 80(6), pages 2369-2429, November.
    9. Rohit Aggarwal & Ram Gopal & Alok Gupta & Harpreet Singh, 2012. "Putting Money Where the Mouths Are: The Relation Between Venture Financing and Electronic Word-of-Mouth," Information Systems Research, INFORMS, vol. 23(3-part-2), pages 976-992, September.
    10. Edward McFowland III & Sriram Somanchi & Daniel B. Neill, 2018. "Efficient Discovery of Heterogeneous Quantile Treatment Effects in Randomized Experiments via Anomalous Pattern Detection," Papers 1803.09159, arXiv.org, revised May 2023.
    11. David Roodman, 2009. "A Note on the Theme of Too Many Instruments," Oxford Bulletin of Economics and Statistics, Department of Economics, University of Oxford, vol. 71(1), pages 135-158, February.
    12. Victor Chernozhukov & Denis Chetverikov & Mert Demirer & Esther Duflo & Christian Hansen & Whitney K. Newey, 2016. "Double machine learning for treatment and causal parameters," CeMMAP working papers CWP49/16, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    13. Hausman, J. A. & Newey, W. K. & Powell, J. L., 1995. "Nonlinear errors in variables Estimation of some Engel curves," Journal of Econometrics, Elsevier, vol. 65(1), pages 205-233, January.
    14. Joshua D. Angrist & Alan B. Krueger, 1993. "Split Sample Instrumental Variables," Working Papers 699, Princeton University, Department of Economics, Industrial Relations Section..
    15. McKinley Blackburn & David Neumark, 1992. "Unobserved Ability, Efficiency Wages, and Interindustry Wage Differentials," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 107(4), pages 1421-1436.
    16. Buse, A, 1992. "The Bias of Instrumental Variable Estimators," Econometrica, Econometric Society, vol. 60(1), pages 173-180, January.
    17. Susan Athey & Guido W. Imbens, 2017. "The State of Applied Econometrics: Causality and Policy Evaluation," Journal of Economic Perspectives, American Economic Association, vol. 31(2), pages 3-32, Spring.
    18. Murphy, Kevin M & Topel, Robert H, 2002. "Estimation and Inference in Two-Step Econometric Models," Journal of Business & Economic Statistics, American Statistical Association, vol. 20(1), pages 88-97, January.
    19. Victor Chernozhukov & Denis Chetverikov & Mert Demirer & Esther Duflo & Christian Hansen & Whitney Newey, 2017. "Double/Debiased/Neyman Machine Learning of Treatment Effects," American Economic Review, American Economic Association, vol. 107(5), pages 261-265, May.
    20. Yingda Lu & Kinshuk Jerath & Param Vir Singh, 2013. "The Emergence of Opinion Leaders in a Networked Online Community: A Dyadic Model with Time Dynamics and a Heuristic for Fast Estimation," Management Science, INFORMS, vol. 59(8), pages 1783-1799, August.
    21. Whitney K. Newey, 2001. "Flexible Simulated Moment Estimation Of Nonlinear Errors-In-Variables Models," The Review of Economics and Statistics, MIT Press, vol. 83(4), pages 616-627, November.
    22. Stefan Sperlich, 2009. "A note on non-parametric estimation with predicted variables," Econometrics Journal, Royal Economic Society, vol. 12(2), pages 382-395, July.
    23. Helmut Küchenhoff & Samuel M. Mwalili & Emmanuel Lesaffre, 2006. "A General Method for Dealing with Misclassification in Regression: The Misclassification SIMEX," Biometrics, The International Biometric Society, vol. 62(1), pages 85-96, March.
    24. Peter Ebbes & Michel Wedel & Ulf Böckenholt & Ton Steerneman, 2005. "Solving and Testing for Regressor-Error (in)Dependence When no Instrumental Variables are Available: With New Evidence for the Effect of Education on Income," Quantitative Marketing and Economics (QME), Springer, vol. 3(4), pages 365-392, December.
    25. Lingsheng Meng & Binzhen Wu & Zhaoguo Zhan, 2016. "Linear regression with an estimated regressor: applications to aggregate indicators of economic development," Empirical Economics, Springer, vol. 50(2), pages 299-316, March.
    26. Michael P. Murray, 2006. "Avoiding Invalid Instruments and Coping with Weak Instruments," Journal of Economic Perspectives, American Economic Association, vol. 20(4), pages 111-132, Fall.
    27. Oxley, Les & McAleer, Michael, 1993. "Econometric Issues in Macroeconomic Models with Generated Regressors," Journal of Economic Surveys, Wiley Blackwell, vol. 7(1), pages 1-40.
    28. Pagan, Adrian, 1984. "Econometric Issues in the Analysis of Regressions with Generated Regressors," International Economic Review, Department of Economics, University of Pennsylvania and Osaka University Institute of Social and Economic Research Association, vol. 25(1), pages 221-247, February.
    29. Jerry Hausman, 2001. "Mismeasured Variables in Econometric Analysis: Problems from the Right and Problems from the Left," Journal of Economic Perspectives, American Economic Association, vol. 15(4), pages 57-67, Fall.
    30. Bin Gu & Prabhudev Konana & Rajagopal Raghunathan & Hsuanwei Michelle Chen, 2014. "Research Note —The Allure of Homophily in Social Media: Evidence from Investor Responses on Virtual Communities," Information Systems Research, INFORMS, vol. 25(3), pages 604-617, September.
    31. Susanne M. Schennach, 2004. "Estimation of Nonlinear Models with Measurement Error," Econometrica, Econometric Society, vol. 72(1), pages 33-75, January.
    32. S. M. Schennach & Yingyao Hu, 2013. "Nonparametric Identification and Semiparametric Estimation of Classical Measurement Error Models Without Side Information," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(501), pages 177-186, March.
    33. Susanne M. Schennach, 2016. "Recent Advances in the Measurement Error Literature," Annual Review of Economics, Annual Reviews, vol. 8(1), pages 341-377, October.
    34. Newey, Whitney K., 1984. "A method of moments interpretation of sequential estimators," Economics Letters, Elsevier, vol. 14(2-3), pages 201-206.
    35. Angrist, Joshua D & Krueger, Alan B, 1995. "Split-Sample Instrumental Variables Estimates of the Return to Schooling," Journal of Business & Economic Statistics, American Statistical Association, vol. 13(2), pages 225-235, April.
    36. Tawei Wang & Karthik N. Kannan & Jackie Rees Ulmer, 2013. "The Association Between the Disclosure and the Realization of Information Security Risk Factors," Information Systems Research, INFORMS, vol. 24(2), pages 201-218, June.
    37. Gérard Biau & Erwan Scornet, 2016. "A random forest guided tour," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(2), pages 197-227, June.
    38. Anindya Ghose & Panagiotis G. Ipeirotis & Beibei Li, 2012. "Designing Ranking Systems for Hotels on Travel Search Engines by Mining User-Generated and Crowdsourced Content," Marketing Science, INFORMS, vol. 31(3), pages 493-520, May.
    39. Khim-Yong Goh & Cheng-Suang Heng & Zhijie Lin, 2013. "Social Media Brand Community and Consumer Behavior: Quantifying the Relative Impact of User- and Marketer-Generated Content," Information Systems Research, INFORMS, vol. 24(1), pages 88-107, March.
    40. Antonio Moreno & Christian Terwiesch, 2014. "Doing Business with Strangers: Reputation in Online Service Marketplaces," Information Systems Research, INFORMS, vol. 25(4), pages 865-886, December.
    41. Fong, Christian & Tyler, Matthew, 2021. "Machine Learning Predictions as Regression Covariates," Political Analysis, Cambridge University Press, vol. 29(4), pages 467-484, October.
    42. Harrison, David Jr. & Rubinfeld, Daniel L., 1978. "Hedonic housing prices and the demand for clean air," Journal of Environmental Economics and Management, Elsevier, vol. 5(1), pages 81-102, March.
    43. Gérard Biau & Erwan Scornet, 2016. "Rejoinder on: A random forest guided tour," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(2), pages 264-268, June.
    44. Mochen Yang & Gediminas Adomavicius & Gordon Burtch & Yuqing Rena, 2018. "Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 29(1), pages 4-24, March.
    45. Victor Chernozhukov & Denis Chetverikov & Mert Demirer & Esther Duflo & Christian Hansen & Whitney Newey & James Robins, 2016. "Double/Debiased Machine Learning for Treatment and Causal Parameters," Papers 1608.00060, arXiv.org, revised Nov 2024.
    46. Li, Tong, 2002. "Robust and consistent estimation of nonlinear errors-in-variables models," Journal of Econometrics, Elsevier, vol. 110(1), pages 1-26, September.
    47. Bin Gu & Prabhudev Konana & Balaji Rajagopalan & Hsuan-Wei Michelle Chen, 2007. "Competition Among Virtual Communities and User Valuation: The Case of Investing-Related Communities," Information Systems Research, INFORMS, vol. 18(1), pages 68-85, March.
    48. Peter Ebbes & Michel Wedel & Ulf Böckenholt, 2009. "Frugal IV alternatives to identify the parameter for an endogenous regressor," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 24(3), pages 446-468, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Gordon Burtch & Edward McFowland III & Mochen Yang & Gediminas Adomavicius, 2023. "EnsembleIV: Creating Instrumental Variables from Ensemble Learners for Robust Statistical Inference," Papers 2303.02820, arXiv.org, revised Dec 2024.
    2. Mochen Yang & Gediminas Adomavicius & Gordon Burtch & Yuqing Rena, 2018. "Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 29(1), pages 4-24, March.
    3. Jiaming Mao & Jingzhi Xu, 2020. "Ensemble Learning with Statistical and Structural Models," Papers 2006.05308, arXiv.org.
    4. Mengke Qiao & Ke-Wei Huang, 2021. "Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 32(2), pages 462-480, June.
    5. Mammen, Enno & Rothe, Christoph & Schienle, Melanie, 2016. "Semiparametric Estimation With Generated Covariates," Econometric Theory, Cambridge University Press, vol. 32(5), pages 1140-1177, October.
    6. Anish Agarwal & Rahul Singh, 2021. "Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy," Papers 2107.02780, arXiv.org, revised Feb 2024.
    7. Susanne M. Schennach, 2012. "Measurement error in nonlinear models - a review," CeMMAP working papers 41/12, Institute for Fiscal Studies.
    8. DUFOUR, Jean-Marie & JASIAK, Joanna, 1998. "Finite-Sample Inference Methods for Simultaneous Equations and Models with Unobserved and Generated Regressors," Cahiers de recherche 9812, Universite de Montreal, Departement de sciences economiques.
    9. Elliott Ash & Daniel L. Chen & Sergio Galletta, 2022. "Measuring Judicial Sentiment: Methods and Application to US Circuit Courts," Economica, London School of Economics and Political Science, vol. 89(354), pages 362-376, April.
    10. Valente, Marica, 2023. "Policy evaluation of waste pricing programs using heterogeneous causal effect estimation," Journal of Environmental Economics and Management, Elsevier, vol. 117(C).
    11. repec:hum:wpaper:sfb649dp2014-043 is not listed on IDEAS
    12. Jiaming Mao & Zhesheng Zheng, 2020. "Structural Regularization," Papers 2004.12601, arXiv.org, revised Jun 2020.
    13. Victor Chernozhukov & Juan Carlos Escanciano & Hidehiko Ichimura & Whitney K. Newey & James M. Robins, 2022. "Locally Robust Semiparametric Estimation," Econometrica, Econometric Society, vol. 90(4), pages 1501-1535, July.
    14. Song, Suyong, 2015. "Semiparametric estimation of models with conditional moment restrictions in the presence of nonclassical measurement errors," Journal of Econometrics, Elsevier, vol. 185(1), pages 95-109.
    15. Aristide Houndetoungan & Abdoul Haki Maoude, 2024. "Inference for Two-Stage Extremum Estimators," THEMA Working Papers 2024-01, THEMA (THéorie Economique, Modélisation et Applications), Université de Cergy-Pontoise.
    16. Aristide Houndetoungan & Abdoul Haki Maoude, 2024. "Inference for Two-Stage Extremum Estimators," Papers 2402.05030, arXiv.org, revised Nov 2024.
    17. Lin, Zhongjian & Hu, Yingyao, 2024. "Binary choice with misclassification and social interactions, with an application to peer effects in attitude," Journal of Econometrics, Elsevier, vol. 238(1).
    18. Susanne M. Schennach, 2013. "Regressions with Berkson errors in covariates - A nonparametric approach," Papers 1308.2836, arXiv.org.
    19. Jinyong Hahn & Jerry Hausman, 2021. "Problems with the Control Variable Approach in Achieving Unbiased Estimates in Nonlinear Models in the Presence of Many Instruments," Journal of Quantitative Economics, Springer;The Indian Econometric Society (TIES), vol. 19(1), pages 39-58, December.
    20. Patrick Saart & Jiti Gao & Nam Hyun Kim, 2014. "Semiparametric methods in nonlinear time series analysis: a selective review," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 26(1), pages 141-169, March.
    21. Stoker, Thomas M. & Berndt, Ernst R. & Denny Ellerman, A. & Schennach, Susanne M., 2005. "Panel data analysis of U.S. coal productivity," Journal of Econometrics, Elsevier, vol. 127(2), pages 131-164, August.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2012.10790. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.