IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v17y2025i6p267-d1681866.html
   My bibliography  Save this article

Detecting Cyber Threats in UWF-ZeekDataFall22 Using K-Means Clustering in the Big Data Environment

Author

Listed:
  • Sikha S. Bagui

    (Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA)

  • Germano Correa Silva De Carvalho

    (Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA)

  • Asmi Mishra

    (Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA)

  • Dustin Mink

    (Department of Cybersecurity, The University of West Florida, Pensacola, FL 32514, USA)

  • Subhash C. Bagui

    (Department of Mathematics and Statistics, The University of West Florida, Pensacola, FL 32514, USA)

  • Stephanie Eager

    (Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA)

Abstract

In an era marked by the rapid growth of the Internet of Things (IoT), network security has become increasingly critical. Traditional Intrusion Detection Systems, particularly signature-based methods, struggle to identify evolving cyber threats such as Advanced Persistent Threats (APTs)and zero-day attacks. Such threats or attacks go undetected with supervised machine-learning methods. In this paper, we apply K-means clustering, an unsupervised clustering technique, to a newly created modern network attack dataset, UWF-ZeekDataFall22. Since this dataset contains labeled Zeek logs, the dataset was de-labeled before using this data for K-means clustering. The labeled data, however, was used in the evaluation phase, to determine the attack clusters post-clustering. In order to identify APTs as well as zero-day attack clusters, three different labeling heuristics were evaluated to determine the attack clusters. To address the challenges faced by Big Data, the Big Data framework, that is, Apache Spark and PySpark, were used for our development environment. In addition, the uniqueness of this work is also in using connection-based features. Using connection-based features, an in-depth study is done to determine the effect of the number of clusters, seeds, as well as features, for each of the different labeling heuristics. If the objective is to detect every single attack, the results indicate that 325 clusters with a seed of 200, using an optimal set of features, would be able to correctly place 99% of attacks.

Suggested Citation

  • Sikha S. Bagui & Germano Correa Silva De Carvalho & Asmi Mishra & Dustin Mink & Subhash C. Bagui & Stephanie Eager, 2025. "Detecting Cyber Threats in UWF-ZeekDataFall22 Using K-Means Clustering in the Big Data Environment," Future Internet, MDPI, vol. 17(6), pages 1-35, June.
  • Handle: RePEc:gam:jftint:v:17:y:2025:i:6:p:267-:d:1681866
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/17/6/267/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/17/6/267/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:17:y:2025:i:6:p:267-:d:1681866. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.