IDEAS home Printed from https://ideas.repec.org/a/eee/intfor/v28y2012i1p224-238.html
   My bibliography  Save this article

Instance sampling in credit scoring: An empirical study of sample size and balancing

Author

Listed:
  • Crone, Sven F.
  • Finlay, Steven

Abstract

To date, best practice in sampling credit applicants has been established based largely on expert opinion, which generally recommends that small samples of 1500 instances each of both goods and bads are sufficient, and that the heavily biased datasets observed should be balanced by undersampling the majority class. Consequently, the topics of sample sizes and sample balance have not been subject to either formal study in credit scoring, or empirical evaluations across different data conditions and algorithms of varying efficiency. This paper describes an empirical study of instance sampling in predicting consumer repayment behaviour, evaluating the relative accuracies of logistic regression, discriminant analysis, decision trees and neural networks on two datasets across 20 samples of increasing size and 29 rebalanced sample distributions created by gradually under- and over-sampling the goods and bads respectively. The paper makes a practical contribution to model building on credit scoring datasets, and provides evidence that using samples larger than those recommended in credit scoring practice provides a significant increase in accuracy across algorithms.

Suggested Citation

  • Crone, Sven F. & Finlay, Steven, 2012. "Instance sampling in credit scoring: An empirical study of sample size and balancing," International Journal of Forecasting, Elsevier, vol. 28(1), pages 224-238.
  • Handle: RePEc:eee:intfor:v:28:y:2012:i:1:p:224-238
    DOI: 10.1016/j.ijforecast.2011.07.006
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0169207011001403
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ijforecast.2011.07.006?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. מחקר - ביטוח לאומי, 2008. "Annual Survey 2007," Working Papers 19, National Insurance Institute of Israel.
    2. Anderson, Raymond, 2007. "The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation," OUP Catalogue, Oxford University Press, number 9780199226405, Decembrie.
    3. Y Liu & M Schumann, 2005. "Data mining feature selection for credit scoring models," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 56(9), pages 1099-1108, September.
    4. Hand, David J., 2009. "Mining the past to determine the future: Problems and possibilities," International Journal of Forecasting, Elsevier, vol. 25(3), pages 441-451, July.
    5. Wu, I-Ding & Hand, David J., 2007. "Handling selection bias when choosing actions in retail credit applications," European Journal of Operational Research, Elsevier, vol. 183(3), pages 1560-1568, December.
    6. B Baesens & T Van Gestel & S Viaene & M Stepanova & J Suykens & J Vanthienen, 2003. "Benchmarking state-of-the-art classification algorithms for credit scoring," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 54(6), pages 627-635, June.
    7. D. J. Hand & W. E. Henley, 1997. "Statistical Classification Methods in Consumer Credit Scoring: a Review," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 160(3), pages 523-541, September.
    8. Banasik, John & Crook, Jonathan, 2007. "Reject inference, augmentation, and sample selection," European Journal of Operational Research, Elsevier, vol. 183(3), pages 1582-1594, December.
    9. S M Finlay, 2006. "Predictive models of expenditure and over-indebtedness for assessing the affordability of new consumer credit applications," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 57(6), pages 655-669, June.
    10. Thomas, Lyn C., 2000. "A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers," International Journal of Forecasting, Elsevier, vol. 16(2), pages 149-172.
    11. G Verstraeten & D Van den Poel, 2005. "The impact of sample bias on consumer credit scoring performance and profitability," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 56(8), pages 981-992, August.
    12. Steven Finlay, 2008. "The Management of Consumer Credit," Palgrave Macmillan Books, Palgrave Macmillan, number 978-0-230-58250-7.
    13. D J Hand, 2005. "Good practice in retail credit scorecard assessment," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 56(9), pages 1109-1117, September.
    14. Crone, Sven F. & Lessmann, Stefan & Stahlbock, Robert, 2006. "The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing," European Journal of Operational Research, Elsevier, vol. 173(3), pages 781-800, September.
    15. Hand, David J., 2009. "Mining the past to determine the future: Rejoinder," International Journal of Forecasting, Elsevier, vol. 25(3), pages 461-462, July.
    16. Crook, Jonathan N. & Edelman, David B. & Thomas, Lyn C., 2007. "Recent developments in consumer credit risk assessment," European Journal of Operational Research, Elsevier, vol. 183(3), pages 1447-1465, December.
    17. Y Kim & S Y Sohn, 2007. "Technology scoring model considering rejected applicants and effect of reject inference," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 58(10), pages 1341-1347, October.
    18. L C Thomas & R W Oliver & D J Hand, 2005. "A survey of the issues in consumer credit modelling research," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 56(9), pages 1006-1015, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Hussein A. Abdou & John Pointon, 2011. "Credit Scoring, Statistical Techniques And Evaluation Criteria: A Review Of The Literature," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 18(2-3), pages 59-88, April.
    2. Finlay, Steven, 2010. "Credit scoring for profitability objectives," European Journal of Operational Research, Elsevier, vol. 202(2), pages 528-537, April.
    3. L C Thomas, 2010. "Consumer finance: challenges for operational research," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 61(1), pages 41-52, January.
    4. Rogelio A. Mancisidor & Michael Kampffmeyer & Kjersti Aas & Robert Jenssen, 2019. "Deep Generative Models for Reject Inference in Credit Scoring," Papers 1904.11376, arXiv.org, revised Sep 2021.
    5. Fang, Fang & Chen, Yuanyuan, 2019. "A new approach for credit scoring by directly maximizing the Kolmogorov–Smirnov statistic," Computational Statistics & Data Analysis, Elsevier, vol. 133(C), pages 180-194.
    6. Lessmann, Stefan & Baesens, Bart & Seow, Hsin-Vonn & Thomas, Lyn C., 2015. "Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research," European Journal of Operational Research, Elsevier, vol. 247(1), pages 124-136.
    7. Rais Ahmad Itoo & A. Selvarasu & José António Filipe, 2015. "Loan Products and Credit Scoring by Commercial Banks (India)," International Journal of Finance, Insurance and Risk Management, International Journal of Finance, Insurance and Risk Management, vol. 5(1), pages 851-851.
    8. Runchi Zhang & Zhiyi Qiu, 2020. "Optimizing hyper-parameters of neural networks with swarm intelligence: A novel framework for credit scoring," PLOS ONE, Public Library of Science, vol. 15(6), pages 1-35, June.
    9. Martin Rezac & Frantisek Rezac, 2011. "How to Measure the Quality of Credit Scoring Models," Czech Journal of Economics and Finance (Finance a uver), Charles University Prague, Faculty of Social Sciences, vol. 61(5), pages 486-507, November.
    10. Finlay, Steven, 2011. "Multiple classifier architectures and their application to credit risk assessment," European Journal of Operational Research, Elsevier, vol. 210(2), pages 368-378, April.
    11. Chen, Shunqin & Guo, Zhengfeng & Zhao, Xinlei, 2021. "Predicting mortgage early delinquency with machine learning methods," European Journal of Operational Research, Elsevier, vol. 290(1), pages 358-372.
    12. Huei-Wen Teng & Michael Lee, 2019. "Estimation Procedures of Using Five Alternative Machine Learning Methods for Predicting Credit Card Default," Review of Pacific Basin Financial Markets and Policies (RPBFMP), World Scientific Publishing Co. Pte. Ltd., vol. 22(03), pages 1-27, September.
    13. Fitzpatrick, Trevor & Mues, Christophe, 2016. "An empirical comparison of classification algorithms for mortgage default prediction: evidence from a distressed mortgage market," European Journal of Operational Research, Elsevier, vol. 249(2), pages 427-439.
    14. Pérez-Martín, A. & Pérez-Torregrosa, A. & Vaca, M., 2018. "Big Data techniques to measure credit banking risk in home equity loans," Journal of Business Research, Elsevier, vol. 89(C), pages 448-454.
    15. Hong Wang & Qingsong Xu & Lifeng Zhou, 2015. "Large Unbalanced Credit Scoring Using Lasso-Logistic Regression Ensemble," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-20, February.
    16. Thomas Wainwright, 2011. "Elite Knowledges: Framing Risk and the Geographies of Credit," Environment and Planning A, , vol. 43(3), pages 650-665, March.
    17. Ha-Thu Nguyen, 2016. "Reject inference in application scorecards: evidence from France," EconomiX Working Papers 2016-10, University of Paris Nanterre, EconomiX.
    18. Dumitrescu, Elena & Hué, Sullivan & Hurlin, Christophe & Tokpavi, Sessi, 2022. "Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects," European Journal of Operational Research, Elsevier, vol. 297(3), pages 1178-1192.
    19. Juan Laborda & Seyong Ryoo, 2021. "Feature Selection in a Credit Scoring Model," Mathematics, MDPI, vol. 9(7), pages 1-22, March.
    20. Ha Thu Nguyen, 2016. "Reject inference in application scorecards: evidence from France," Working Papers hal-04141601, HAL.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:intfor:v:28:y:2012:i:1:p:224-238. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/ijforecast .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.