IDEAS home Printed from https://ideas.repec.org/p/osf/socarx/453jk.html
   My bibliography  Save this paper

How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do About It

Author

Listed:
  • Zhang, Han

    (The Hong Kong University of Science and Technology)

Abstract

Social scientists have increasingly been applying machine learning algorithms to "big data" to measure theoretical concepts they cannot easily measure before, and then been using these machine-predicted variables in a regression. This article first demonstrates that directly inserting binary predictions (i.e., classification) without regard for prediction error will generally lead to attenuation biases of either slope coefficients or marginal effect estimates. We then propose several estimators to obtain consistent estimates of coefficients. The estimators require the existence of validation data, of which researchers have both machine prediction and true values. This validation data is either automatically available during training algorithms or can be easily obtained. Monte Carlo simulations demonstrate the effectiveness of the proposed estimators. Finally, we summarize the usage pattern of machine learning predictions in 18 recent publications in top social science journals, apply our proposed estimators to two of them, and offer some practical recommendations.

Suggested Citation

  • Zhang, Han, 2021. "How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do About It," SocArXiv 453jk, Center for Open Science.
  • Handle: RePEc:osf:socarx:453jk
    DOI: 10.31219/osf.io/453jk
    as

    Download full text from publisher

    File URL: https://osf.io/download/60b0940a3a6df10031d4e9ff/
    Download Restriction: no

    File URL: https://libkey.io/10.31219/osf.io/453jk?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Aigner, Dennis J., 1973. "Regression with a binary independent variable subject to errors of observation," Journal of Econometrics, Elsevier, vol. 1(1), pages 49-59, March.
    2. Victor Chernozhukov & Denis Chetverikov & Mert Demirer & Esther Duflo & Christian Hansen & Whitney Newey & James Robins, 2018. "Double/debiased machine learning for treatment and structural parameters," Econometrics Journal, Royal Economic Society, vol. 21(1), pages 1-68, February.
    3. Barberã , Pablo & Casas, Andreu & Nagler, Jonathan & Egan, Patrick J. & Bonneau, Richard & Jost, John T. & Tucker, Joshua A., 2019. "Who Leads? Who Follows? Measuring Issue Attention and Agenda Setting by Legislators and the Mass Public Using Social Media Data," American Political Science Review, Cambridge University Press, vol. 113(4), pages 883-901, November.
    4. Thomas J. Kane & Cecilia Elena Rouse & Douglas Staiger, 1999. "Estimating Returns to Schooling When Schooling is Misreported," NBER Working Papers 7235, National Bureau of Economic Research, Inc.
    5. Carlos Daniel Paulino & Paulo Soares & John Neuhaus, 2003. "Binomial Regression with Misclassification," Biometrics, The International Biometric Society, vol. 59(3), pages 670-675, September.
    6. Stefan Wager & Susan Athey, 2018. "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(523), pages 1228-1242, July.
    7. Robin Burgess & Matthew Hansen & Benjamin A. Olken & Peter Potapov & Stefanie Sieber, 2012. "The Political Economy of Deforestation in the Tropics," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 127(4), pages 1707-1754.
    8. A. Belloni & D. Chen & V. Chernozhukov & C. Hansen, 2012. "Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain," Econometrica, Econometric Society, vol. 80(6), pages 2369-2429, November.
    9. Bollinger, Christopher R., 1996. "Bounding mean regressions when a binary regressor is mismeasured," Journal of Econometrics, Elsevier, vol. 73(2), pages 387-399, August.
    10. Oriana Bandiera & Andrea Prat & Stephen Hansen & Raffaella Sadun, 2020. "CEO Behavior and Firm Performance," Journal of Political Economy, University of Chicago Press, vol. 128(4), pages 1325-1369.
    11. Hausman, J. A. & Abrevaya, Jason & Scott-Morton, F. M., 1998. "Misclassification of the dependent variable in a discrete-response setting," Journal of Econometrics, Elsevier, vol. 87(2), pages 239-269, September.
    12. Lowande, Kenneth, 2018. "Who Polices the Administrative State?," American Political Science Review, Cambridge University Press, vol. 112(4), pages 874-890, November.
    13. King, Gary & Zeng, Langche, 2001. "Explaining Rare Events in International Relations," International Organization, Cambridge University Press, vol. 55(3), pages 693-715, July.
    14. Bruce Meyer & Nikolas Mittag, 2013. "Misclassification In Binary Choice Models," Working Papers 13-27, Center for Economic Studies, U.S. Census Bureau.
    15. Athey, Susan & Imbens, Guido W., 2019. "Machine Learning Methods Economists Should Know About," Research Papers 3776, Stanford University, Graduate School of Business.
    16. Margaret E. Roberts & Brandon M. Stewart & Dustin Tingley & Christopher Lucas & Jetson Leder‐Luis & Shana Kushner Gadarian & Bethany Albertson & David G. Rand, 2014. "Structural Topic Models for Open‐Ended Survey Responses," American Journal of Political Science, John Wiley & Sons, vol. 58(4), pages 1064-1082, October.
    17. Mitts, Tamar, 2019. "From Isolation to Radicalization: Anti-Muslim Hostility and Support for ISIS in the West," American Political Science Review, Cambridge University Press, vol. 113(1), pages 173-194, February.
    18. Pan, Jennifer & Chen, Kaiping, 2018. "Concealing Corruption: How Chinese Officials Distort Upward Reporting of Online Grievances," American Political Science Review, Cambridge University Press, vol. 112(3), pages 602-620, August.
    19. Kevin M. Quinn & Burt L. Monroe & Michael Colaresi & Michael H. Crespin & Dragomir R. Radev, 2010. "How to Analyze Political Attention with Minimal Assumptions and Costs," American Journal of Political Science, John Wiley & Sons, vol. 54(1), pages 209-228, January.
    20. Takaya Saito & Marc Rehmsmeier, 2015. "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets," PLOS ONE, Public Library of Science, vol. 10(3), pages 1-21, March.
    21. Grimmer, Justin & Stewart, Brandon M., 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, Cambridge University Press, vol. 21(3), pages 267-297, July.
    22. repec:fth:prinin:419 is not listed on IDEAS
    23. DiTraglia, Francis J. & García-Jimeno, Camilo, 2019. "Identifying the effect of a mis-classified, binary, endogenous regressor," Journal of Econometrics, Elsevier, vol. 209(2), pages 376-390.
    24. Anita R. Gohdes, 2020. "Repression Technology: Internet Accessibility and State Violence," American Journal of Political Science, John Wiley & Sons, vol. 64(3), pages 488-503, July.
    25. Susanne M. Schennach, 2016. "Recent Advances in the Measurement Error Literature," Annual Review of Economics, Annual Reviews, vol. 8(1), pages 341-377, October.
    26. Bound, John & Brown, Charles & Mathiowetz, Nancy, 2001. "Measurement error in survey data," Handbook of Econometrics, in: J.J. Heckman & E.E. Leamer (ed.), Handbook of Econometrics, edition 1, volume 5, chapter 59, pages 3705-3843, Elsevier.
    27. Frazis, Harley & Loewenstein, Mark A., 2003. "Estimating linear regressions with mismeasured, possibly endogenous, binary explanatory variables," Journal of Econometrics, Elsevier, vol. 117(1), pages 151-178, November.
    28. Meyer, Bruce D. & Mittag, Nikolas, 2017. "Misclassification in binary choice models," Journal of Econometrics, Elsevier, vol. 200(2), pages 295-311.
    29. Susan Athey & Guido W. Imbens, 2019. "Machine Learning Methods That Economists Should Know About," Annual Review of Economics, Annual Reviews, vol. 11(1), pages 685-725, August.
    30. Jon Kleinberg & Himabindu Lakkaraju & Jure Leskovec & Jens Ludwig & Sendhil Mullainathan, 2018. "Human Decisions and Machine Predictions," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 133(1), pages 237-293.
    31. Katagiri, Azusa & Min, Eric, 2019. "The Credibility of Public and Private Signals: A Document-Based Approach," American Political Science Review, Cambridge University Press, vol. 113(1), pages 156-172, February.
    32. repec:cup:apsrev:v:113:y:2019:i:04:p:883-901_00 is not listed on IDEAS
    33. Daniel J. Hopkins & Gary King, 2010. "A Method of Automated Nonparametric Content Analysis for Social Science," American Journal of Political Science, John Wiley & Sons, vol. 54(1), pages 229-247, January.
    34. AIGNER, Dennis J., 1973. "Regression with a binary independent variable subject to errors of observation," LIDAM Reprints CORE 130, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    35. Goldberg, Amir & Srivastava, Sameer B & Manian, Govind & Monroe, William & Potts, Christopher, 2016. "Fitting In or Standing Out? The Tradeoffs of Structural and Cultural Embeddedness," Institute for Research on Labor and Employment, Working Paper Series qt9bf631rg, Institute of Industrial Relations, UC Berkeley.
    36. Thomas J. Kane & Cecilia Rouse & Douglas Staiger, 1999. "Estimating Returns to Schooling When Schooling is Misreported," Working Papers 798, Princeton University, Department of Economics, Industrial Relations Section..
    37. Kosuke Imai & Teppei Yamamoto, 2010. "Causal Inference with Differential Measurement Error: Nonparametric Identification and Sensitivity Analysis," American Journal of Political Science, John Wiley & Sons, vol. 54(2), pages 543-560, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Wossen, Tesfamicheal & Abay, Kibrom A. & Abdoulaye, Tahirou, 2022. "Misperceiving and misreporting input quality: Implications for input use and productivity," Journal of Development Economics, Elsevier, vol. 157(C).
    2. Takahide Yanagi, 2019. "Inference on local average treatment effects for misclassified treatment," Econometric Reviews, Taylor & Francis Journals, vol. 38(8), pages 938-960, September.
    3. Tommasi, Denni & Zhang, Lina, 2024. "Bounding program benefits when participation is misreported," Journal of Econometrics, Elsevier, vol. 238(1).
    4. Akanksha Negi & Digvijay Singh Negi, 2022. "Difference-in-Differences with a Misclassified Treatment," Papers 2208.02412, arXiv.org.
    5. Brachet, Tanguy, 2008. "Maternal Smoking, Misclassification, and Infant Health," MPRA Paper 21466, University Library of Munich, Germany.
    6. Steven J. Haider & Melvin Stephens Jr., 2020. "Correcting for Misclassified Binary Regressors Using Instrumental Variables," NBER Working Papers 27797, National Bureau of Economic Research, Inc.
    7. Adele Bergin, 2015. "Employer Changes and Wage Changes: Estimation with Measurement Error in a Binary Variable," LABOUR, CEIS, vol. 29(2), pages 194-223, June.
    8. Christian vom Lehn & Cache Ellsworth & Zachary Kroff, 2022. "Reconciling Occupational Mobility in the Current Population Survey," Journal of Labor Economics, University of Chicago Press, vol. 40(4), pages 1005-1051.
    9. Adele Bergin, 2013. "Job Changes and Wage Changes: Estimation with Measurement Error in a Binary Variable," Economics Department Working Paper Series n240-13.pdf, Department of Economics, National University of Ireland - Maynooth.
    10. Molinari, Francesca, 2008. "Partial identification of probability distributions with misclassified data," Journal of Econometrics, Elsevier, vol. 144(1), pages 81-117, May.
    11. Francis DiTraglia & Camilo Garcia-Jimeno, 2015. "On Mis-measured Binary Regressors: New Results And Some Comments on the Literature, Third Version," PIER Working Paper Archive 15-040, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, revised 24 Nov 2015.
    12. Nguimkeu, Pierre & Denteh, Augustine & Tchernis, Rusty, 2019. "On the estimation of treatment effects with endogenous misreporting," Journal of Econometrics, Elsevier, vol. 208(2), pages 487-506.
    13. Lundberg, Ian & Brand, Jennie E. & Jeon, Nanum, 2022. "Researcher reasoning meets computational capacity: Machine learning for social science," SocArXiv s5zc8, Center for Open Science.
    14. Francis J. DiTraglia & Camilo Garcia-Jimeno, 2020. "Identifying the effect of a mis-classified, binary, endogenous regressor," Papers 2011.07272, arXiv.org.
    15. DiTraglia, Francis J. & García-Jimeno, Camilo, 2019. "Identifying the effect of a mis-classified, binary, endogenous regressor," Journal of Econometrics, Elsevier, vol. 209(2), pages 376-390.
    16. Francis DiTraglia & Camilo Garcia-Jimeno, 2015. "On Mis-measured Binary Regressors: New Results And Some Comments on the Literature, Second Version," PIER Working Paper Archive 15-039, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, revised 11 Nov 2015.
    17. Orville Mondal & Rui Wang, 2024. "Partial Identification of Binary Choice Models with Misreported Outcomes," Papers 2401.17137, arXiv.org.
    18. Frazis, Harley & Loewenstein, Mark A., 2003. "Estimating linear regressions with mismeasured, possibly endogenous, binary explanatory variables," Journal of Econometrics, Elsevier, vol. 117(1), pages 151-178, November.
    19. Arthur Lewbel, 2007. "Estimation of Average Treatment Effects with Misclassification," Econometrica, Econometric Society, vol. 75(2), pages 537-551, March.
    20. Aprajit Mahajan, 2006. "Identification and Estimation of Regression Models with Misclassification," Econometrica, Econometric Society, vol. 74(3), pages 631-665, May.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:osf:socarx:453jk. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: OSF (email available below). General contact details of provider: https://arabixiv.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.