IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v6y2021i2p11-d484845.html
   My bibliography  Save this article

The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance

Author

Listed:
  • Esra’a Alshdaifat

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

  • Doa’a Alshdaifat

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

  • Ayoub Alsarhan

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

  • Fairouz Hussein

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

  • Subhieh Moh’d Faraj S. El-Salhi

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

Abstract

It is recognized that the performance of any prediction model is a function of several factors. One of the most significant factors is the adopted preprocessing techniques. In other words, preprocessing is an essential process to generate an effective and efficient classification model. This paper investigates the impact of the most widely used preprocessing techniques, with respect to numerical features, on the performance of classification algorithms. The effect of combining various normalization techniques and handling missing values strategies is assessed on eighteen benchmark datasets using two well-known classification algorithms and adopting different performance evaluation metrics and statistical significance tests. According to the reported experimental results, the impact of the adopted preprocessing techniques varies from one classification algorithm to another. In addition, a statistically significant difference between the considered data preprocessing techniques is demonstrated.

Suggested Citation

  • Esra’a Alshdaifat & Doa’a Alshdaifat & Ayoub Alsarhan & Fairouz Hussein & Subhieh Moh’d Faraj S. El-Salhi, 2021. "The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance," Data, MDPI, vol. 6(2), pages 1-23, January.
  • Handle: RePEc:gam:jdataj:v:6:y:2021:i:2:p:11-:d:484845
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/6/2/11/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/6/2/11/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Crone, Sven F. & Lessmann, Stefan & Stahlbock, Robert, 2006. "The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing," European Journal of Operational Research, Elsevier, vol. 173(3), pages 781-800, September.
    2. Akçay, Hüseyin & Filik, Tansu, 2017. "Short-term wind speed forecasting by spectral analysis from long-term observations with missing values," Applied Energy, Elsevier, vol. 191(C), pages 653-662.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Fairouz Hussein & Ayat Al-Ahmad & Subhieh El-Salhi & Esra’a Alshdaifat & Mo’taz Al-Hami, 2022. "Advances in Contextual Action Recognition: Automatic Cheating Detection Using Machine Learning Techniques," Data, MDPI, vol. 7(9), pages 1-13, August.
    2. Samuka Mohanty & Rajashree Dash, 2023. "A New Dual Normalization for Enhancing the Bitcoin Pricing Capability of an Optimized Low Complexity Neural Net with TOPSIS Evaluation," Mathematics, MDPI, vol. 11(5), pages 1-28, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Aly, Hamed H.H., 2020. "A novel deep learning intelligent clustered hybrid models for wind speed and power forecasting," Energy, Elsevier, vol. 213(C).
    2. Aly, Hamed H.H., 2022. "A Hybrid Optimized Model of Adaptive Neuro-Fuzzy Inference System, Recurrent Kalman Filter and Neuro-Wavelet for Wind Power Forecasting Driven by DFIG," Energy, Elsevier, vol. 239(PE).
    3. Lee, In Gyu & Yoon, Sang Won & Won, Daehan, 2022. "A Mixed Integer Linear Programming Support Vector Machine for Cost-Effective Group Feature Selection: Branch-Cut-and-Price Approach," European Journal of Operational Research, Elsevier, vol. 299(3), pages 1055-1068.
    4. Crone, Sven F. & Finlay, Steven, 2012. "Instance sampling in credit scoring: An empirical study of sample size and balancing," International Journal of Forecasting, Elsevier, vol. 28(1), pages 224-238.
    5. Georgios Marinakos & Sophia Daskalaki, 2017. "Imbalanced customer classification for bank direct marketing," Journal of Marketing Analytics, Palgrave Macmillan, vol. 5(1), pages 14-30, March.
    6. Coussement, Kristof & De Bock, Koen W., 2013. "Customer churn prediction in the online gambling industry: The beneficial effect of ensemble learning," Journal of Business Research, Elsevier, vol. 66(9), pages 1629-1636.
    7. Brandner, Hubertus & Lessmann, Stefan & Voß, Stefan, 2013. "A memetic approach to construct transductive discrete support vector machines," European Journal of Operational Research, Elsevier, vol. 230(3), pages 581-595.
    8. R Fildes & K Nikolopoulos & S F Crone & A A Syntetos, 2008. "Forecasting and operational research: a review," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 59(9), pages 1150-1172, September.
    9. Qin, Li & Liu, Shi & Kang, Yi & Yan, Song An & Inaki Schlaberg, H. & Wang, Zhan, 2019. "Wind velocity distribution reconstruction using CFD database with Tucker decomposition and sensor measurement," Energy, Elsevier, vol. 167(C), pages 1236-1250.
    10. Lessmann, Stefan & Baesens, Bart & Seow, Hsin-Vonn & Thomas, Lyn C., 2015. "Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research," European Journal of Operational Research, Elsevier, vol. 247(1), pages 124-136.
    11. Coussement, Kristof & Van den Bossche, Filip A.M. & De Bock, Koen W., 2014. "Data accuracy's impact on segmentation performance: Benchmarking RFM analysis, logistic regression, and decision trees," Journal of Business Research, Elsevier, vol. 67(1), pages 2751-2758.
    12. K. W. De Bock & D. Van Den Poel, 2012. "Reconciling Performance and Interpretability in Customer Churn Prediction using Ensemble Learning based on Generalized Additive Models," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 12/805, Ghent University, Faculty of Economics and Business Administration.
    13. Bose, Indranil & Chen, Xi, 2009. "Quantitative models for direct marketing: A review from systems perspective," European Journal of Operational Research, Elsevier, vol. 195(1), pages 1-16, May.
    14. Ding‐Wen Tan & William Yeoh & Yee Ling Boo & Soung‐Yue Liew, 2013. "The Impact Of Feature Selection: A Data‐Mining Application In Direct Marketing," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 20(1), pages 23-38, January.
    15. Chao-Ming Huang & Shin-Ju Chen & Sung-Pei Yang & Hsin-Jen Chen, 2023. "One-Day-Ahead Hourly Wind Power Forecasting Using Optimized Ensemble Prediction Methods," Energies, MDPI, vol. 16(6), pages 1-22, March.
    16. Niu, Xinsong & Wang, Jiyang, 2019. "A combined model based on data preprocessing strategy and multi-objective optimization algorithm for short-term wind speed forecasting," Applied Energy, Elsevier, vol. 241(C), pages 519-539.
    17. Stefan Lessmann & Stefan Voß, 2010. "Customer-Centric Decision Support," Business & Information Systems Engineering: The International Journal of WIRTSCHAFTSINFORMATIK, Springer;Gesellschaft für Informatik e.V. (GI), vol. 2(2), pages 79-93, April.
    18. Coussement, Kristof & Buckinx, Wouter, 2011. "A probability-mapping algorithm for calibrating the posterior probabilities: A direct marketing application," European Journal of Operational Research, Elsevier, vol. 214(3), pages 732-738, November.
    19. Chen, Zhen-Yu & Fan, Zhi-Ping & Sun, Minghe, 2012. "A hierarchical multiple kernel support vector machine for customer churn prediction using longitudinal behavioral data," European Journal of Operational Research, Elsevier, vol. 223(2), pages 461-472.
    20. Liu, Hui & Chen, Chao, 2019. "Data processing strategies in wind energy forecasting models and applications: A comprehensive review," Applied Energy, Elsevier, vol. 249(C), pages 392-408.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:6:y:2021:i:2:p:11-:d:484845. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.