IDEAS home Printed from https://ideas.repec.org/p/umc/wpaper/2114.html
   My bibliography  Save this paper

Addressing Sample Selection Bias for Machine Learning Methods

Author

Abstract

We study approaches for adjusting machine learning methods when the training sample differs from the prediction sample on unobserved dimensions. The machine learning literature predominately assumes selection only on observed dimensions. Common suggestions are to re-weight or control for variables that influence selection as solutions to selection on observables. Simulation results show that selection on unobservables increases mean squared prediction error using common machine-learning algorithms. Common machine learning practices such as re-weighting or controlling for variables that influence selection into the training or testing sample often worsens sample selection bias. We suggest two control-function approaches that remove the effects of selection bias before training and find that they reduce mean-squared prediction error in simulations with a high degree of selection. We apply these approaches to predicting the vote share of the incumbent in gubernatorial elections using previously observed re-election bids. We find that ignoring selection on unobservables leads to substantially higher predicted vote shares for the incumbent than when the control function approach is used.

Suggested Citation

  • Dylan Brewer & Alyssa Carlson, 2021. "Addressing Sample Selection Bias for Machine Learning Methods," Working Papers 2114, Department of Economics, University of Missouri.
  • Handle: RePEc:umc:wpaper:2114
    as

    Download full text from publisher

    File URL: https://drive.google.com/file/d/1N-3QpN8hsmLne0PAH2oA7hebqAvLEf6b/view?usp=sharingview?usp=sharing
    Download Restriction: no
    ---><---

    Other versions of this item:

    More about this item

    Keywords

    sample selection; machine learning; control function; inverse probability weighting;
    All these keywords.

    JEL classification:

    • C13 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Estimation: General
    • C31 - Mathematical and Quantitative Methods - - Multiple or Simultaneous Equation Models; Multiple Variables - - - Cross-Sectional Models; Spatial Models; Treatment Effect Models; Quantile Regressions; Social Interaction Models
    • C55 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Large Data Sets: Modeling and Analysis
    • D72 - Microeconomics - - Analysis of Collective Decision-Making - - - Political Processes: Rent-seeking, Lobbying, Elections, Legislatures, and Voting Behavior

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:umc:wpaper:2114. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chao Gu (email available below). General contact details of provider: https://edirc.repec.org/data/edumous.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.