IDEAS home Printed from https://ideas.repec.org/h/spr/prbchp/978-3-031-88052-0_45.html
   My bibliography  Save this book chapter

Cyber Security Data Science: Machine Learning Methods and Their Performance on Imbalanced Datasets

In: Digital Management and Artificial Intelligence

Author

Listed:
  • Mateo Lopez-Ledezma

    (Universidad Privada Boliviana)

  • Gissel Velarde

    (IU International University of Applied Sciences)

Abstract

Cybersecurity has become essential worldwide and at all levels, concerning individuals, institutions, and governments. A basic principle in cybersecurity is to be always alert. Therefore, automation is imperative in processes where the volume of daily operations is large. Several cybersecurity applications can be addressed as binary classification problems, including anomaly detection, fraud detection, intrusion detection, spam detection, or malware detection. In many cases, the positive class samples, those that represent a problem, occur at a much lower frequency than negative samples, and this poses a challenge for machine learning algorithms since learning patterns out of under-represented samples is hard. This is known in machine learning as imbalance learning. In this study, we systematically evaluate various machine learning methods using two representative financial datasets containing numerical and categorical features. The Credit Card dataset contains 283726 samples, 31 features, and 0.2 percent of the transactions are fraudulent (imbalance ratio of 598.84:1). The PaySim dataset contains 6362620 samples, 11 features and 0.13 percent of the transactions are fraudulent (imbalance ratio of 773.70:1). We present three experiments. In the first experiment, we evaluate single classifiers including Random Forests, Light Gradient Boosting Machine, eXtreme Gradient Boosting, Logistic Regression, Decision Tree, and Gradient Boosting Decision Tree. In the second experiment, we test different sampling techniques including over-sampling, under-sampling, Synthetic Minority Over-sampling Techique, and Self-Paced Ensembling. In the last experiment, we evaluate Self-Paced Ensembling and its number of base classifiers. We found that imbalance learning techniques had positive and negative effects, as reported in related studies. Thus, these techniques should be applied with caution. Besides, we found different best performers for each dataset. Therefore, we recommend testing single classifiers and imbalance learning techniques for each new dataset and application involving imbalanced datasets as is the case in several cyber security applications. We provide the code with all experiments as open-source (Available at https://github.com/MateoLopez00/Imbalanced-Learning-Empirical-Evaluation .).

Suggested Citation

  • Mateo Lopez-Ledezma & Gissel Velarde, 2025. "Cyber Security Data Science: Machine Learning Methods and Their Performance on Imbalanced Datasets," Springer Proceedings in Business and Economics, in: Richard C. Geibel & Shalva Machavariani (ed.), Digital Management and Artificial Intelligence, pages 569-578, Springer.
  • Handle: RePEc:spr:prbchp:978-3-031-88052-0_45
    DOI: 10.1007/978-3-031-88052-0_45
    as

    Download full text from publisher

    To our knowledge, this item is not available for download. To find whether it is available, there are three options:
    1. Check below whether another version of this item is available online.
    2. Check on the provider's web page whether it is in fact available.
    3. Perform a search for a similarly titled item that would be available.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:prbchp:978-3-031-88052-0_45. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.