Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy

Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy

Author

Listed:

Dean P. Foster
Robert A. Stine

Registered:

Dean P. Foster

Abstract

We develop and illustrate a methodology for fitting models to large, complex data sets. The methodology uses standard regression techniques that make few assumptions about the structure of the data. We accomplish this with three small modifications to stepwise regression: (1) We add interactions to capture non-linearities and indicator functions to capture missing values; (2) We exploit modern decision theoretic variable selection criteria; and (3) We estimate standard error using a conservative approach that works for heteroscedastic data. Omitting any one of these modifications leads to poor performance. We illustrate our methodology by predicting the onset of personal bankruptcy among users of credit cards. This applications presents many challenges, ranging from the rare frequency of bankruptcy to the size of the available database. Only 2,244 bankruptcy events appear among some 3 million months of customer activity. To predict these, we begin with 255 features to which we add missing value indicators and pairwise interactions that expand to a set of over 67,000 potential predictors. From these, our method selects a model with 39 predictors chosen by sequentially comparing estimates of their significance to a series of thresholds. The resulting model not only avoids over-fitting the data, it also predicts well out of sample. To find half of the 1800 bankruptcies hidden in a validation sample of 2.3 million observations, one need only search the 8500 cases having the largest model predictions.

Suggested Citation

Dean P. Foster & Robert A. Stine, 2001. "Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy," Center for Financial Institutions Working Papers 01-05, Wharton School Center for Financial Institutions, University of Pennsylvania.

Handle: RePEc:wop:pennin:01-05

Download full text from publisher

Other versions of this item:

Foster D.P. & Stine R.A., 2004. "Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy," Journal of the American Statistical Association, American Statistical Association, vol. 99, pages 303-313, January.

Citations

Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

Cited by:

Haughton, Dominique Author_Email: & Le, Thanh Loan Thi Author_Email:, 2007. "Shifts in Living Standards: The Case of Vietnamese Households 1992-1998," Philippine Journal of Development, Philippine Institute for Development Studies.
Margherita Doria & Elisa Luciano & Patrizia Semeraro, 2022. "Machine learning techniques in joint default assessment," Papers 2205.01524, arXiv.org, revised Sep 2023.
- Edoardo Fadda & Elisa Luciano & Patrizia Semeraro, 2024. "Machine Learning techniques in joint default assessment," Carlo Alberto Notebooks 723 JEL Classification: G, Collegio Carlo Alberto.
Khandani, Amir E. & Kim, Adlar J. & Lo, Andrew W., 2010. "Consumer credit-risk models via machine-learning algorithms," Journal of Banking & Finance, Elsevier, vol. 34(11), pages 2767-2787, November.
Barrios, Erniel B. & Mina, Christian D., 2009. "Profiling Poverty with Multivariate Adaptive Regression Splines," Discussion Papers DP 2009-29, Philippine Institute for Development Studies.
E.B. Nkemnole & A.A. Akinsete, 2021. "Hidden Markov Model using transaction patterns for ATM card fraud detection," Theoretical and Applied Economics, Asociatia Generala a Economistilor din Romania / Editura Economica, vol. 0(4(629), W), pages 51-70, Winter.
Carlos Serrano-Cinca & Begoña Gutiérrez-Nieto, 2011. "Partial Least Square Discriminant Analysis (PLS-DA) for bankruptcy prediction," Working Papers CEB 11-024, ULB -- Universite Libre de Bruxelles.
Barrios, Erniel B. & Mina, Christian D., 2009. "Profiling Poverty with Multivariate Adaptive Regression Splines," Discussion Papers DP 2009-29, Philippine Institute for Development Studies.
- Barrios, Erniel B. & Mina, Christian D., 2013. "Profiling Poverty with Multivariate Adaptive Regression Splines," Philippine Journal of Development PJD 2010 Vol. 37 No. 2d, Philippine Institute for Development Studies.
Alexandra Schwarz, 2011. "Measurement, Monitoring, and Forecasting of Consumer Credit Default Risk - An Indicator Approach Based on Individual Payment Histories," Schumpeter Discussion Papers sdp11004, Universitätsbibliothek Wuppertal, University Library.

More about this item

NEP fields

This paper has been announced in the following NEP Reports:

NEP-CMP-2001-07-23 (Computational Economics)
NEP-ECM-2001-07-23 (Econometrics)
NEP-IAS-2001-07-23 (Insurance Economics)

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:wop:pennin:01-05. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

We have no bibliographic references for this item. You can help adding them by using this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Thomas Krichel (email available below). General contact details of provider: https://edirc.repec.org/data/fiupaus.html .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy

Author

Abstract

Suggested Citation

Download full text from publisher

Other versions of this item:

Citations

More about this item

NEP fields

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data