IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v55y2011i1p168-183.html
   My bibliography  Save this article

Robust weighted kernel logistic regression in imbalanced and rare events data

Author

Listed:
  • Maalouf, Maher
  • Trafalis, Theodore B.

Abstract

Recent developments in computing and technology, along with the availability of large amounts of raw data, have contributed to the creation of many effective techniques and algorithms in the fields of pattern recognition and machine learning. The main objectives for developing these algorithms include identifying patterns within the available data or making predictions, or both. Great success has been achieved with many classification techniques in real-life applications. With regard to binary data classification in particular, analysis of data containing rare events or disproportionate class distributions poses a great challenge to industry and to the machine learning community. This study examines rare events (REs) with binary dependent variables containing many more non-events (zeros) than events (ones). These variables are difficult to predict and to explain as has been evidenced in the literature. This research combines rare events corrections to Logistic Regression (LR) with truncated Newton methods and applies these techniques to Kernel Logistic Regression (KLR). The resulting model, Rare Event Weighted Kernel Logistic Regression (RE-WKLR), is a combination of weighting, regularization, approximate numerical methods, kernelization, bias correction, and efficient implementation, all of which are critical to enabling RE-WKLR to be an effective and powerful method for predicting rare events. Comparing RE-WKLR to SVM and TR-KLR, using non-linearly separable, small and large binary rare event datasets, we find that RE-WKLR is as fast as TR-KLR and much faster than SVM. In addition, according to the statistical significance test, RE-WKLR is more accurate than both SVM and TR-KLR.

Suggested Citation

  • Maalouf, Maher & Trafalis, Theodore B., 2011. "Robust weighted kernel logistic regression in imbalanced and rare events data," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 168-183, January.
  • Handle: RePEc:eee:csdana:v:55:y:2011:i:1:p:168-183
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167-9473(10)00259-8
    Download Restriction: Full text for ScienceDirect subscribers only.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Yu Xie & Charles F. Manski, 1989. "The Logit Model and Response-Based Samples," Sociological Methods & Research, , vol. 17(3), pages 283-302, February.
    2. Georg Heinze & Michael Schemper, 2001. "A Solution to the Problem of Monotone Likelihood in Cox Regression," Biometrics, The International Biometric Society, vol. 57(1), pages 114-119, March.
    3. Imbens, Guido W. & Lancaster, Tony, 1996. "Efficient estimation and stratified sampling," Journal of Econometrics, Elsevier, vol. 74(2), pages 289-318, October.
    4. Cameron,A. Colin & Trivedi,Pravin K., 2005. "Microeconometrics," Cambridge Books, Cambridge University Press, number 9780521848053.
    5. King, Gary & Zeng, Langche, 2001. "Logistic Regression in Rare Events Data," Political Analysis, Cambridge University Press, vol. 9(2), pages 137-163, January.
    6. King, Gary & Zeng, Langche, 2001. "Explaining Rare Events in International Relations," International Organization, Cambridge University Press, vol. 55(3), pages 693-715, July.
    7. Cramer,J. S., 2011. "Logit Models from Economics and Other Fields," Cambridge Books, Cambridge University Press, number 9780521188036.
    8. Manski, Charles F & Lerman, Steven R, 1977. "The Estimation of Choice Probabilities from Choice Based Samples," Econometrica, Econometric Society, vol. 45(8), pages 1977-1988, November.
    9. Quigley, John & Bedford, Tim & Walls, Lesley, 2007. "Estimating rate of occurrence of rare events with empirical bayes: A railway application," Reliability Engineering and System Safety, Elsevier, vol. 92(5), pages 619-627.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Dustin C.S. Wagner & Kash Barker, 2014. "Statistical methods for modeling the risk of runway excursions," Journal of Risk Research, Taylor & Francis Journals, vol. 17(7), pages 885-901, August.
    2. Wolfgang Karl Härdle & Dedy Dwi Prastyo & Christian Hafner, 2012. "Support Vector Machines with Evolutionary Feature Selection for Default Prediction," SFB 649 Discussion Papers SFB649DP2012-030, Sonderforschungsbereich 649, Humboldt University, Berlin, Germany.
    3. Peter D. Brandon & Danielle George-Lucas & Oleg Ivashchenko, 2022. "How architectural principles can help conceptualize and analyze breakups among intergenerational households," Palgrave Communications, Palgrave Macmillan, vol. 9(1), pages 1-10, December.
    4. Hani M. Samawi & Haresh Rochani & Daniel Linder & Arpita Chatterjee, 2017. "More efficient logistic analysis using moving extreme ranked set sampling," Journal of Applied Statistics, Taylor & Francis Journals, vol. 44(4), pages 753-766, March.
    5. Henry R. Scharf & Xinyi Lu & Perry J. Williams & Mevin B. Hooten, 2022. "Constructing Flexible, Identifiable and Interpretable Statistical Models for Binary Data," International Statistical Review, International Statistical Institute, vol. 90(2), pages 328-345, August.
    6. Neuberg Richard & Hannah Lauren, 2017. "Loan pricing under estimation risk," Statistics & Risk Modeling, De Gruyter, vol. 34(1-2), pages 69-87, June.
    7. Maher Maalouf & Theodore Trafalis & Indra Adrianto, 2011. "Kernel logistic regression using truncated Newton method," Computational Management Science, Springer, vol. 8(4), pages 415-428, November.
    8. Jessica Pesantez-Narvaez & Montserrat Guillen & Manuela Alcañiz, 2021. "RiskLogitboost Regression for Rare Events in Binary Response: An Econometric Approach," Mathematics, MDPI, vol. 9(5), pages 1-21, March.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lahiri, Kajal & Yang, Liu, 2013. "Forecasting Binary Outcomes," Handbook of Economic Forecasting, in: G. Elliott & C. Granger & A. Timmermann (ed.), Handbook of Economic Forecasting, edition 1, volume 2, chapter 0, pages 1025-1106, Elsevier.
    2. Stock, Ruth Maria & von Hippel, Eric & Gillert, Nils Lennart, 2016. "Impacts of personality traits on consumer innovation success," Research Policy, Elsevier, vol. 45(4), pages 757-769.
    3. Tomz, Michael & King, Gary & Zeng, Langche, 2003. "ReLogit: Rare Events Logistic Regression," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 8(i02).
    4. Springborn, Michael & Romagosa, Christina M. & Keller, Reuben P., 2011. "The value of nonindigenous species risk assessment in international trade," Ecological Economics, Elsevier, vol. 70(11), pages 2145-2153, September.
    5. repec:jss:jstsof:08:i02 is not listed on IDEAS
    6. Abhirup Chakrabarti & Will Mitchell, 2013. "The Persistent Effect of Geographic Distance in Acquisition Target Selection," Organization Science, INFORMS, vol. 24(6), pages 1805-1826, December.
    7. Avanzini, Diego & Martı́nez, Juan Francisco & Pérez, Vı́ctor, 2020. "Assessing mortgage default risk in full-recourse economies, with an application to the case of Chile," Latin American Journal of Central Banking (previously Monetaria), Elsevier, vol. 1(1).
    8. Ahmed, M.S. & Attouch, M.K. & Dabo-Niang, S., 2018. "Binary functional linear models under choice-based sampling," Econometrics and Statistics, Elsevier, vol. 7(C), pages 134-152.
    9. Sung Jae Jun & Sokbae Lee, 2020. "Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions," Papers 2004.08318, arXiv.org, revised Oct 2023.
    10. Lorenzo Cassi & Anne Plunket, 2014. "Proximity, network formation and inventive performance: in search of the proximity paradox," The Annals of Regional Science, Springer;Western Regional Science Association, vol. 53(2), pages 395-422, September.
    11. Trent Geisler & Herman Ray & Ying Xie, 2023. "Finding the Proverbial Needle: Improving Minority Class Identification Under Extreme Class Imbalance," Journal of Classification, Springer;The Classification Society, vol. 40(1), pages 192-212, April.
    12. Sarlin, Peter & von Schweinitz, Gregor, 2021. "Optimizing Policymakers’ Loss Functions In Crisis Prediction: Before, Within Or After?," Macroeconomic Dynamics, Cambridge University Press, vol. 25(1), pages 100-123, January.
    13. Joachim Wagner, 2005. "Der Noth gehorchend, nicht dem eignen Trieb Nascent Necessity and Opportunity Entrepreneurs in Germany Evidence from the Regional Entrepreneurship Monitor (REM)," Working Paper Series in Economics 10, University of Lüneburg, Institute of Economics.
    14. Lancaster, Tony & Imbens, Guido, 1996. "Case-control studies with contaminated controls," Journal of Econometrics, Elsevier, vol. 71(1-2), pages 145-160.
    15. Amanda Coston & Edward H. Kennedy, 2022. "The role of the geometric mean in case-control studies," Papers 2207.09016, arXiv.org.
    16. Joachim Wagner, 2005. "Nascent and infant entrepreneurs in Germany. Evidence from the Regional Entrepreneurship Monitor (REM)," Labor and Demography 0504010, University Library of Munich, Germany.
    17. Daniel McFadden, 2001. "Economic Choices," American Economic Review, American Economic Association, vol. 91(3), pages 351-378, June.
    18. Michael Horowitz & Rose McDermott & Allan C. Stam, 2005. "Leader Age, Regime Type, and Violent International Relations," Journal of Conflict Resolution, Peace Science Society (International), vol. 49(5), pages 661-685, October.
    19. Merz, Joachim & Paic, Peter, 2006. "Start-up success of freelancers New microeconometric evidence from the German Socio-Economic Panel," MPRA Paper 5737, University Library of Munich, Germany.
    20. Pamela Giustinelli, 2016. "Group Decision Making With Uncertain Outcomes: Unpacking Child–Parent Choice Of The High School Track," International Economic Review, Department of Economics, University of Pennsylvania and Osaka University Institute of Social and Economic Research Association, vol. 57(2), pages 573-602, May.
    21. He, Xuan & Xiao, Weicheng, 2022. "What drives family SMEs to internationalize? An integrated perspective of community institutions and knowledge resources," Journal of International Financial Markets, Institutions and Money, Elsevier, vol. 81(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:55:y:2011:i:1:p:168-183. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.