IDEAS home Printed from https://ideas.repec.org/p/ant/wpaper/2017005.html
   My bibliography  Save this paper

A benchmarking study of classification techniques for behavioral data

Author

Listed:
  • DE CNUDDE, Sofie
  • MARTENS, David
  • EVGENIOU, Theodoros
  • PROVOST, Foster

Abstract

The predictive power in ubiquitous big, behavioral data has been emphasized by previous academic research. The ultra-high dimensional and sparse characteristics, however, pose significant challenges on state-of-the-art classification techniques. Moreover, no consensus exists regarding a feasible trade-off between classification performance and computational complexity. This work provides a contribution in this direction through a systematic benchmarking study. Forty-three fine-grained behavioral data sets are analyzed with 11 classification techniques. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. Firstly, an inherent AUC-time trade-off becomes clear, making the choice for an appropriate classifier dependent on time restrictions and data set characteristics. Logistic regression achieves the best AUC, however in the worst amount of time. Also, L2 regularization proves better than sparse L1-regularization. An attractive trade-off is found in a similarity-based technique called PSN. Secondly, the results illustrate that significant value lies in collecting and analyzing even more data, both in the instance and in the feature dimension, contrasting findings on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.

Suggested Citation

  • DE CNUDDE, Sofie & MARTENS, David & EVGENIOU, Theodoros & PROVOST, Foster, 2017. "A benchmarking study of classification techniques for behavioral data," Working Papers 2017005, University of Antwerp, Faculty of Business and Economics.
  • Handle: RePEc:ant:wpaper:2017005
    as

    Download full text from publisher

    File URL: https://repository.uantwerpen.be/docman/irua/f3979a/142910.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. DE CNUDDE, Sofie & MOEYERSOMS, Julie & STANKOVA, Marija & TOBBACK, Ellen & JAVALY, Vinayak & MARTENS, David, 2015. "Who cares about your Facebook friends? Credit scoring for microfinance," Working Papers 2015018, University of Antwerp, Faculty of Business and Economics.
    2. K. W. De Bock & D. Van Den Poel & S. Manigart, 2009. "Predicting web site audience demographics for web advertising targeting using multi-web site clickstream data," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 09/618, Ghent University, Faculty of Economics and Business Administration.
    3. Qiang Yang & Xindong Wu, 2006. "10 Challenging Problems In Data Mining Research," International Journal of Information Technology & Decision Making (IJITDM), World Scientific Publishing Co. Pte. Ltd., vol. 5(04), pages 597-604.
    4. David J. Hand & Keming Yu, 2001. "Idiot's Bayes—Not So Stupid After All?," International Statistical Review, International Statistical Institute, vol. 69(3), pages 385-398, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. PRAET, Stiene & VAN AELST, Peter & MARTENS, David, 2018. "I like, therefore I am. Predictive modeling to gain insights in political preference in a multi-party system," Working Papers 2018014, University of Antwerp, Faculty of Business and Economics.
    2. DE CNUDDE, Sofie & MARTENS, David & PROVOST, Foster, 2018. "An exploratory study towards applying and demystifying deep learning classification on behavioral big data," Working Papers 2018002, University of Antwerp, Faculty of Business and Economics.
    3. Arno de Caigny & Kristof Coussement & Koen de Bock, 2020. "Leveraging fine-grained transaction data for customer life event predictions," Post-Print hal-02507998, HAL.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ulf Römer & Oliver Musshoff, 2017. "Can agricultural credit scoring for microfinance institutions be implemented and improved by weather data?," Agricultural Finance Review, Emerald Group Publishing Limited, vol. 78(1), pages 83-97, December.
    2. Li, Yibei & Wang, Ximei & Djehiche, Boualem & Hu, Xiaoming, 2020. "Credit scoring by incorporating dynamic networked information," European Journal of Operational Research, Elsevier, vol. 286(3), pages 1103-1112.
    3. Ionuţ ŢĂRANU, 2016. "Data mining in healthcare: decision making and precision," Database Systems Journal, Academy of Economic Studies - Bucharest, Romania, vol. 6(4), pages 33-40, May.
    4. Li, Hailin, 2017. "Distance measure with improved lower bound for multivariate time series," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 468(C), pages 622-637.
    5. Sascha O. Becker & Luigi Pascali, 2019. "Religion, Division of Labor, and Conflict: Anti-semitism in Germany over 600 Years," American Economic Review, American Economic Association, vol. 109(5), pages 1764-1804, May.
    6. Rajeev D S Raizada & Yune-Sang Lee, 2013. "Smoothness without Smoothing: Why Gaussian Naive Bayes Is Not Naive for Multi-Subject Searchlight Studies," PLOS ONE, Public Library of Science, vol. 8(7), pages 1-10, July.
    7. Marbac, Matthieu & Vandewalle, Vincent, 2019. "A tractable multi-partitions clustering," Computational Statistics & Data Analysis, Elsevier, vol. 132(C), pages 167-179.
    8. Harshita Patel & Dharmendra Singh Rajput & G Thippa Reddy & Celestine Iwendi & Ali Kashif Bashir & Ohyun Jo, 2020. "A review on classification of imbalanced data for wireless sensor networks," International Journal of Distributed Sensor Networks, , vol. 16(4), pages 15501477209, April.
    9. Qi Liu & Gengzhong Feng & Nengmin Wang & Giri Kumar Tayi, 2018. "A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge," Information Systems Frontiers, Springer, vol. 20(2), pages 401-416, April.
    10. Aletti, Giacomo, 2018. "Generation of discrete random variables in scalable frameworks," Statistics & Probability Letters, Elsevier, vol. 132(C), pages 99-106.
    11. Brighton, Henry, 2020. "Statistical foundations of ecological rationality," Economics - The Open-Access, Open-Assessment E-Journal (2007-2020), Kiel Institute for the World Economy (IfW Kiel), vol. 14, pages 1-32.
    12. Liao, Jui-Jung & Shih, Ching-Hui & Chen, Tai-Feng & Hsu, Ming-Fu, 2014. "An ensemble-based model for two-class imbalanced financial problem," Economic Modelling, Elsevier, vol. 37(C), pages 175-183.
    13. Keng-Hoong Ng & Chin-Kuan Ho & Somnuk Phon-Amnuaisuk, 2012. "A Hybrid Distance Measure for Clustering Expressed Sequence Tags Originating from the Same Gene Family," PLOS ONE, Public Library of Science, vol. 7(10), pages 1-14, October.
    14. Vilém Novák & Soheyla Mirshahi, 2021. "On the Similarity and Dependence of Time Series," Mathematics, MDPI, vol. 9(5), pages 1-14, March.
    15. Riesgo García, María Victoria & Krzemień, Alicja & Manzanedo del Campo, Miguel Ángel & Escanciano García-Miranda, Carmen & Sánchez Lasheras, Fernando, 2018. "Rare earth elements price forecasting by means of transgenic time series developed with ARIMA models," Resources Policy, Elsevier, vol. 59(C), pages 95-102.
    16. D. Thorleuchter & D. Van Den Poel & A. Prinzie, 2011. "Analyzing existing customers’ websites to improve the customer acquisition process as well as the profitability prediction in B-to-B marketing," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 11/733, Ghent University, Faculty of Economics and Business Administration.
    17. Qi Liu & Gengzhong Feng & Nengmin Wang & Giri Kumar Tayi, 0. "A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge," Information Systems Frontiers, Springer, vol. 0, pages 1-16.
    18. Marvin, Hans J.P. & Bouzembrak, Yamine, 2020. "A system approach towards prediction of food safety hazards: Impact of climate and agrichemical use on the occurrence of food safety hazards," Agricultural Systems, Elsevier, vol. 178(C).
    19. Yaxi Liu & Dayu Cheng & Tao Pei & Hua Shu & Xianhui Ge & Ting Ma & Yunyan Du & Yang Ou & Meng Wang & Lianming Xu, 2020. "Inferring gender and age of customers in shopping malls via indoor positioning data," Environment and Planning B, , vol. 47(9), pages 1672-1689, November.
    20. Becker, Sascha O. & Pascali, Luigi, 2016. "Religion, Division of Labor and Conflict: Anti-Semitism in German Regions over 600 Years," CAGE Online Working Paper Series 288, Competitive Advantage in the Global Economy (CAGE).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:ant:wpaper:2017005. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Joeri Nys (email available below). General contact details of provider: https://edirc.repec.org/data/ftufsbe.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.