Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy
AbstractWe develop and illustrate a methodology for fitting models to large, complex data sets. The methodology uses standard regression techniques that make few assumptions about the structure of the data. We accomplish this with three small modifications to stepwise regression: (1) We add interactions to capture non-linearities and indicator functions to capture missing values; (2) We exploit modern decision theoretic variable selection criteria; and (3) We estimate standard error using a conservative approach that works for heteroscedastic data. Omitting any one of these modifications leads to poor performance. We illustrate our methodology by predicting the onset of personal bankruptcy among users of credit cards. This applications presents many challenges, ranging from the rare frequency of bankruptcy to the size of the available database. Only 2,244 bankruptcy events appear among some 3 million months of customer activity. To predict these, we begin with 255 features to which we add missing value indicators and pairwise interactions that expand to a set of over 67,000 potential predictors. From these, our method selects a model with 39 predictors chosen by sequentially comparing estimates of their significance to a series of thresholds. The resulting model not only avoids over-fitting the data, it also predicts well out of sample. To find half of the 1800 bankruptcies hidden in a validation sample of 2.3 million observations, one need only search the 8500 cases having the largest model predictions.
Download InfoIf you experience problems downloading a file, check if you have the proper application to view it first. In case of further problems read the IDEAS help page. Note that these files are not on the IDEAS site. Please be patient as the files may be large.
Bibliographic InfoPaper provided by Wharton School Center for Financial Institutions, University of Pennsylvania in its series Center for Financial Institutions Working Papers with number 01-05.
Date of creation: Feb 2001
Date of revision:
Contact details of provider:
Postal: 3301 Steinberg Hall-Dietrich Hall, 3620 Locust Walk, Philadelphia, PA 19104.6367
Web page: http://fic.wharton.upenn.edu/fic/
More information through EDIRC
This paper has been announced in the following NEP Reports:
- NEP-ALL-2001-07-23 (All new papers)
- NEP-CMP-2001-07-23 (Computational Economics)
- NEP-ECM-2001-07-23 (Econometrics)
- NEP-IAS-2001-07-23 (Insurance Economics)
You can help add them by filling out this form.
CitEc Project, subscribe to its RSS feed for this item.
- Barrios, Erniel B. & Mina, Christian D., 2009. "Profiling Poverty with Multivariate Adaptive Regression Splines," Discussion Papers DP 2009-29, Philippine Institute for Development Studies.
- Alexandra Schwarz, 2011. "Measurement, Monitoring, and Forecasting of Consumer Credit Default Risk - An Indicator Approach Based on Individual Payment Histories," Schumpeter Discussion Papers sdp11004, Universitätsbibliothek Wuppertal, University Library.
- repec:phd:pjdevt:pjd_2005_vol._xxxii_no._1-d is not listed on IDEAS
- Khandani, Amir E. & Kim, Adlar J. & Lo, Andrew W., 2010. "Consumer credit-risk models via machine-learning algorithms," Journal of Banking & Finance, Elsevier, vol. 34(11), pages 2767-2787, November.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Thomas Krichel).
If references are entirely missing, you can add them using this form.