IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2012.10790.html
   My bibliography  Save this paper

Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem

Author

Listed:
  • Mochen Yang
  • Edward McFowland III
  • Gordon Burtch
  • Gediminas Adomavicius

Abstract

Combining machine learning with econometric analysis is becoming increasingly prevalent in both research and practice. A common empirical strategy involves the application of predictive modeling techniques to 'mine' variables of interest from available data, followed by the inclusion of those variables into an econometric framework, with the objective of estimating causal effects. Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest. We propose employing random forest not just for prediction, but also for generating instrumental variables to address the measurement error embedded in the prediction. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, yet which also make 'different' mistakes, i.e., have weakly correlated prediction errors. A key observation is that these properties are closely related to the relevance and exclusion requirements of valid instrumental variables. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the other trees serve as its instruments. Simulation experiments demonstrate the efficacy of the proposed approach in mitigating estimation biases and its superior performance over three alternative methods for bias correction.

Suggested Citation

  • Mochen Yang & Edward McFowland III & Gordon Burtch & Gediminas Adomavicius, 2020. "Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem," Papers 2012.10790, arXiv.org.
  • Handle: RePEc:arx:papers:2012.10790
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2012.10790
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Kuchenhoff, Helmut & Lederer, Wolfgang & Lesaffre, Emmanuel, 2007. "Asymptotic variance estimation for the misclassification SIMEX," Computational Statistics & Data Analysis, Elsevier, vol. 51(12), pages 6197-6211, August.
    2. Richard W. Blundell & James L. Powell, 2004. "Endogeneity in Semiparametric Binary Response Models," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 71(3), pages 655-679.
    3. Blaser, Rico & Fryzlewicz, Piotr, 2016. "Random rotation ensembles," LSE Research Online Documents on Economics 62182, London School of Economics and Political Science, LSE Library.
    4. A. Belloni & D. Chen & V. Chernozhukov & C. Hansen, 2012. "Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain," Econometrica, Econometric Society, vol. 80(6), pages 2369-2429, November.
    5. Mammen, Enno & Rothe, Christoph & Schienle, Melanie, 2016. "Semiparametric Estimation With Generated Covariates," Econometric Theory, Cambridge University Press, vol. 32(5), pages 1140-1177, October.
    6. Edward McFowland III & Sriram Somanchi & Daniel B. Neill, 2018. "Efficient Discovery of Heterogeneous Quantile Treatment Effects in Randomized Experiments via Anomalous Pattern Detection," Papers 1803.09159, arXiv.org, revised May 2023.
    7. David Roodman, 2009. "A Note on the Theme of Too Many Instruments," Oxford Bulletin of Economics and Statistics, Department of Economics, University of Oxford, vol. 71(1), pages 135-158, February.
    8. Victor Chernozhukov & Denis Chetverikov & Mert Demirer & Esther Duflo & Christian Hansen & Whitney K. Newey, 2016. "Double machine learning for treatment and causal parameters," CeMMAP working papers CWP49/16, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    9. Hausman, J. A. & Newey, W. K. & Powell, J. L., 1995. "Nonlinear errors in variables Estimation of some Engel curves," Journal of Econometrics, Elsevier, vol. 65(1), pages 205-233, January.
    10. Joshua D. Angrist & Alan B. Krueger, 1993. "Split Sample Instrumental Variables," Working Papers 699, Princeton University, Department of Economics, Industrial Relations Section..
    11. McKinley Blackburn & David Neumark, 1992. "Unobserved Ability, Efficiency Wages, and Interindustry Wage Differentials," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 107(4), pages 1421-1436.
    12. Buse, A, 1992. "The Bias of Instrumental Variable Estimators," Econometrica, Econometric Society, vol. 60(1), pages 173-180, January.
    13. Susan Athey & Guido W. Imbens, 2017. "The State of Applied Econometrics: Causality and Policy Evaluation," Journal of Economic Perspectives, American Economic Association, vol. 31(2), pages 3-32, Spring.
    14. Murphy, Kevin M & Topel, Robert H, 2002. "Estimation and Inference in Two-Step Econometric Models," Journal of Business & Economic Statistics, American Statistical Association, vol. 20(1), pages 88-97, January.
    15. Whitney K. Newey, 2001. "Flexible Simulated Moment Estimation Of Nonlinear Errors-In-Variables Models," The Review of Economics and Statistics, MIT Press, vol. 83(4), pages 616-627, November.
    16. Stefan Sperlich, 2009. "A note on non-parametric estimation with predicted variables," Econometrics Journal, Royal Economic Society, vol. 12(2), pages 382-395, July.
    17. Helmut Küchenhoff & Samuel M. Mwalili & Emmanuel Lesaffre, 2006. "A General Method for Dealing with Misclassification in Regression: The Misclassification SIMEX," Biometrics, The International Biometric Society, vol. 62(1), pages 85-96, March.
    18. Peter Ebbes & Michel Wedel & Ulf Böckenholt & Ton Steerneman, 2005. "Solving and Testing for Regressor-Error (in)Dependence When no Instrumental Variables are Available: With New Evidence for the Effect of Education on Income," Quantitative Marketing and Economics (QME), Springer, vol. 3(4), pages 365-392, December.
    19. Lingsheng Meng & Binzhen Wu & Zhaoguo Zhan, 2016. "Linear regression with an estimated regressor: applications to aggregate indicators of economic development," Empirical Economics, Springer, vol. 50(2), pages 299-316, March.
    20. Michael P. Murray, 2006. "Avoiding Invalid Instruments and Coping with Weak Instruments," Journal of Economic Perspectives, American Economic Association, vol. 20(4), pages 111-132, Fall.
    21. Oxley, Les & McAleer, Michael, 1993. "Econometric Issues in Macroeconomic Models with Generated Regressors," Journal of Economic Surveys, Wiley Blackwell, vol. 7(1), pages 1-40.
    22. Pagan, Adrian, 1984. "Econometric Issues in the Analysis of Regressions with Generated Regressors," International Economic Review, Department of Economics, University of Pennsylvania and Osaka University Institute of Social and Economic Research Association, vol. 25(1), pages 221-247, February.
    23. Jerry Hausman, 2001. "Mismeasured Variables in Econometric Analysis: Problems from the Right and Problems from the Left," Journal of Economic Perspectives, American Economic Association, vol. 15(4), pages 57-67, Fall.
    24. Susanne M. Schennach, 2004. "Estimation of Nonlinear Models with Measurement Error," Econometrica, Econometric Society, vol. 72(1), pages 33-75, January.
    25. S. M. Schennach & Yingyao Hu, 2013. "Nonparametric Identification and Semiparametric Estimation of Classical Measurement Error Models Without Side Information," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(501), pages 177-186, March.
    26. Susanne M. Schennach, 2016. "Recent Advances in the Measurement Error Literature," Annual Review of Economics, Annual Reviews, vol. 8(1), pages 341-377, October.
    27. Newey, Whitney K., 1984. "A method of moments interpretation of sequential estimators," Economics Letters, Elsevier, vol. 14(2-3), pages 201-206.
    28. Angrist, Joshua D & Krueger, Alan B, 1995. "Split-Sample Instrumental Variables Estimates of the Return to Schooling," Journal of Business & Economic Statistics, American Statistical Association, vol. 13(2), pages 225-235, April.
    29. Harrison, David Jr. & Rubinfeld, Daniel L., 1978. "Hedonic housing prices and the demand for clean air," Journal of Environmental Economics and Management, Elsevier, vol. 5(1), pages 81-102, March.
    30. Mochen Yang & Gediminas Adomavicius & Gordon Burtch & Yuqing Rena, 2018. "Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining," Information Systems Research, INFORMS, vol. 29(1), pages 4-24, March.
    31. Li, Tong, 2002. "Robust and consistent estimation of nonlinear errors-in-variables models," Journal of Econometrics, Elsevier, vol. 110(1), pages 1-26, September.
    32. Peter Ebbes & Michel Wedel & Ulf Böckenholt, 2009. "Frugal IV alternatives to identify the parameter for an endogenous regressor," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 24(3), pages 446-468, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mochen Yang & Edward McFowland & Gordon Burtch & Gediminas Adomavicius, 2022. "Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem," INFORMS Joural on Data Science, INFORMS, vol. 1(2), pages 138-155, October.
    2. Mammen, Enno & Rothe, Christoph & Schienle, Melanie, 2016. "Semiparametric Estimation With Generated Covariates," Econometric Theory, Cambridge University Press, vol. 32(5), pages 1140-1177, October.
    3. Gordon Burtch & Edward McFowland III & Mochen Yang & Gediminas Adomavicius, 2023. "EnsembleIV: Creating Instrumental Variables from Ensemble Learners for Robust Statistical Inference," Papers 2303.02820, arXiv.org.
    4. DUFOUR, Jean-Marie & JASIAK, Joanna, 1998. "Finite-Sample Inference Methods for Simultaneous Equations and Models with Unobserved and Generated Regressors," Cahiers de recherche 9812, Universite de Montreal, Departement de sciences economiques.
    5. Susanne M. Schennach, 2012. "Measurement error in nonlinear models - a review," CeMMAP working papers 41/12, Institute for Fiscal Studies.
    6. Anish Agarwal & Rahul Singh, 2021. "Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy," Papers 2107.02780, arXiv.org, revised Feb 2024.
    7. Victor Chernozhukov & Juan Carlos Escanciano & Hidehiko Ichimura & Whitney K. Newey & James M. Robins, 2022. "Locally Robust Semiparametric Estimation," Econometrica, Econometric Society, vol. 90(4), pages 1501-1535, July.
    8. Patrick Saart & Jiti Gao & Nam Hyun Kim, 2014. "Semiparametric methods in nonlinear time series analysis: a selective review," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 26(1), pages 141-169, March.
    9. Stoker, Thomas M. & Berndt, Ernst R. & Denny Ellerman, A. & Schennach, Susanne M., 2005. "Panel data analysis of U.S. coal productivity," Journal of Econometrics, Elsevier, vol. 127(2), pages 131-164, August.
    10. Prokhorov, Artem & Schmidt, Peter, 2009. "GMM redundancy results for general missing data problems," Journal of Econometrics, Elsevier, vol. 151(1), pages 47-55, July.
    11. Yingyao Hu & Geert Ridder, 2012. "Estimation of nonlinear models with mismeasured regressors using marginal information," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 27(3), pages 347-385, April.
    12. Karun Adusumilli & Taisuke Otsu, 2015. "Nonparametric instrumental regression with errors in variables," STICERD - Econometrics Paper Series /2015/585, Suntory and Toyota International Centres for Economics and Related Disciplines, LSE.
    13. Xiaohong Chen & Yingyao Hu, 2006. "Identification and Inference of Nonlinear Models Using Two Samples with Arbitrary Measurement Errors," Cowles Foundation Discussion Papers 1590, Cowles Foundation for Research in Economics, Yale University.
    14. Hu, Yingyao, 2008. "Identification and estimation of nonlinear models with misclassification error using instrumental variables: A general solution," Journal of Econometrics, Elsevier, vol. 144(1), pages 27-61, May.
    15. Jayeeta Bhattacharya, 2020. "Quantile regression with generated dependent variable and covariates," Papers 2012.13614, arXiv.org.
    16. Song, Suyong, 2015. "Semiparametric estimation of models with conditional moment restrictions in the presence of nonclassical measurement errors," Journal of Econometrics, Elsevier, vol. 185(1), pages 95-109.
    17. Jiaming Mao & Jingzhi Xu, 2020. "Ensemble Learning with Statistical and Structural Models," Papers 2006.05308, arXiv.org.
    18. Yingyao Hu & Susanne M. Schennach, 2008. "Instrumental Variable Treatment of Nonclassical Measurement Error Models," Econometrica, Econometric Society, vol. 76(1), pages 195-216, January.
    19. Wang, Liqun & Hsiao, Cheng, 2011. "Method of moments estimation and identifiability of semiparametric nonlinear errors-in-variables models," Journal of Econometrics, Elsevier, vol. 165(1), pages 30-44.
    20. Susanne M. Schennach, 2013. "Regressions with Berkson errors in covariates - A nonparametric approach," Papers 1308.2836, arXiv.org.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2012.10790. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.