IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0322738.html
   My bibliography  Save this article

EODA: A three-stage efficient outlier detection approach using Boruta-RF feature selection and enhanced KNN-based clustering algorithm

Author

Listed:
  • Sunil Kumar
  • Sudeep Varshney
  • Usha Jain
  • Prashant Johri
  • Abdulaziz S Almazyad
  • Ali Wagdy Mohamed
  • Mehdi Hosseinzadeh
  • Mohammad Shokouhifar

Abstract

Outlier detection is essential for identifying unusual patterns or observations that significantly deviate from the normal behavior of a dataset. With the rapid growth of data science, the prevalence of anomalies and outliers has increased, which can disrupt system modeling and parameter estimation, leading to inaccurate results. Recently, deep learning-based outlier detection methods have gained significant attention, but their performance is often limited by challenges in parameter selection and the nearest neighbor search. To overcome these limitations, we propose a three-stage Efficient Outlier Detection Approach (named EODA), that not only detects outliers with high accuracy but also emphasizes dataset characteristics. In the first stage, we apply a feature selection algorithm based on the Boruta method and Random Forest to reduce the data size by selecting the most relevant attributes and calculating the highest Z-score of shadow features. In the second stage, we improve the K-nearest neighbors algorithm to enhance the accuracy of nearest neighbor identification in the clustering phase. Finally, the third stage efficiently identifies the most significant outliers within clustered datasets. We evaluate the proposed EODA algorithm across eight UCI machine-learning repository datasets. The results demonstrate the effectiveness of our EODA approach, achieving a Precision of 63.07%, Recall of 82.49%, and an F1-Score of 64.53%, outperforming the existing techniques in the field.

Suggested Citation

  • Sunil Kumar & Sudeep Varshney & Usha Jain & Prashant Johri & Abdulaziz S Almazyad & Ali Wagdy Mohamed & Mehdi Hosseinzadeh & Mohammad Shokouhifar, 2025. "EODA: A three-stage efficient outlier detection approach using Boruta-RF feature selection and enhanced KNN-based clustering algorithm," PLOS ONE, Public Library of Science, vol. 20(5), pages 1-25, May.
  • Handle: RePEc:plo:pone00:0322738
    DOI: 10.1371/journal.pone.0322738
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0322738
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0322738&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0322738?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Douglas M. Hawkins, 1980. "Critical Values for Identifying Outliers," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 29(1), pages 95-96, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Karol Pilot & Alicja Ganczarek-Gamrot & Krzysztof Kania, 2024. "Dealing with Anomalies in Day-Ahead Market Prediction Using Machine Learning Hybrid Model," Energies, MDPI, vol. 17(17), pages 1-20, September.
    2. Francesca Ieva & Anna Maria Paganoni, 2020. "Component-wise outlier detection methods for robustifying multivariate functional samples," Statistical Papers, Springer, vol. 61(2), pages 595-614, April.
    3. Andrzej Chmielowiec, 2021. "Algorithm for error-free determination of the variance of all contiguous subsequences and fixed-length contiguous subsequences for a sequence of industrial measurement data," Computational Statistics, Springer, vol. 36(4), pages 2813-2840, December.
    4. Marc Chataigner & Stéphane Crépey & Jiang Pu, 2020. "Nowcasting Networks," Post-Print hal-03910123, HAL.
    5. Greco, Salvatore & Ishizaka, Alessio & Tasiou, Menelaos & Torrisi, Gianpiero, 2019. "Sigma-Mu efficiency analysis: A methodology for evaluating units through composite indicators," European Journal of Operational Research, Elsevier, vol. 278(3), pages 942-960.
    6. David Juárez-Varón & Victoria Tur-Viñes & Alejandro Rabasa-Dolado & Kristina Polotskaya, 2020. "An Adaptive Machine Learning Methodology Applied to Neuromarketing Analysis: Prediction of Consumer Behaviour Regarding the Key Elements of the Packaging Design of an Educational Toy," Social Sciences, MDPI, vol. 9(9), pages 1-23, September.
    7. Zhongqiu Wang & Guan Yuan & Haoran Pei & Yanmei Zhang & Xiao Liu, 2020. "Unsupervised learning trajectory anomaly detection algorithm based on deep representation," International Journal of Distributed Sensor Networks, , vol. 16(12), pages 15501477209, December.
    8. Arata, Linda & Fabrizi, Enrico & Sckokai, Paolo, 2020. "A worldwide analysis of trend in crop yields and yield variability: Evidence from FAO data," Economic Modelling, Elsevier, vol. 90(C), pages 190-208.
    9. Wentao Yang & Huaxi He & Dongsheng Wei & Hao Chen, 2022. "Generating pseudo-absence samples of invasive species based on outlier detection in the geographical characteristic space," Journal of Geographical Systems, Springer, vol. 24(2), pages 261-279, April.
    10. Fournier, Nicholas PhD & Farid, Yashar Zeinali PhD & Patire, Anthony David PhD, 2021. "Potential Erroneous Degradation of High Occupancy Vehicle (HOV) Facilities," Institute of Transportation Studies, Research Reports, Working Papers, Proceedings qt3z76r7tj, Institute of Transportation Studies, UC Berkeley.
    11. Puteri Paramita & Zuduo Zheng & Md Mazharul Haque & Simon Washington & Paul Hyland, 2018. "User satisfaction with train fares: A comparative analysis in five Australian cities," PLOS ONE, Public Library of Science, vol. 13(6), pages 1-26, June.
    12. Liqun Diao & Grace Y. Yi, 2023. "Classification Trees with Mismeasured Responses," Journal of Classification, Springer;The Classification Society, vol. 40(1), pages 168-191, April.
    13. Gasser, Patrick, 2020. "A review on energy security indices to compare country performances," Energy Policy, Elsevier, vol. 139(C).
    14. Nirpeksh Kumar, 2019. "Exact distributions of tests of outliers for exponential samples," Statistical Papers, Springer, vol. 60(6), pages 2031-2061, December.
    15. Stanley Munamato Mbiva & Fabio Mathias Correa, 2024. "Machine Learning to Enhance the Detection of Terrorist Financing and Suspicious Transactions in Migrant Remittances," JRFM, MDPI, vol. 17(5), pages 1-19, April.
    16. Taha Yehia & Ali Wahba & Sondos Mostafa & Omar Mahmoud, 2022. "Suitability of Different Machine Learning Outlier Detection Algorithms to Improve Shale Gas Production Data for Effective Decline Curve Analysis," Energies, MDPI, vol. 15(23), pages 1-25, November.
    17. Beata Gavurova & Jaroslav Belas & Katarina Zvarikova & Martin Rigelsky & Viera Ivankova, 2021. "The Effect of Education and R&D on Tourism Spending in OECD Countries: An Empirical Study," The AMFITEATRU ECONOMIC journal, Academy of Economic Studies - Bucharest, Romania, vol. 23(58), pages 806-806, August.
    18. Antenangeli Leonardo & Cantú Francisco, 2019. "Right on Time: An Electoral Audit for the Publication of Vote Results," Statistics, Politics and Policy, De Gruyter, vol. 10(2), pages 137-186, December.
    19. Yeon-Jin Sim & Jeongmin Kim & Jaehyeon Choi & Jun-Ho Huh, 2022. "System Design for Detecting Real Estate Speculation Abusing Inside Information: For the Fair Reallocation of Land," Land, MDPI, vol. 11(4), pages 1-17, April.
    20. Marcel Clermont & Julia Schaefer, 2019. "Identification of Outliers in Data Envelopment Analysis," Schmalenbach Business Review, Springer;Schmalenbach-Gesellschaft, vol. 71(4), pages 475-496, October.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0322738. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.