IDEAS home Printed from https://ideas.repec.org/a/spr/infosf/v16y2014i5d10.1007_s10796-013-9430-0.html
   My bibliography  Save this article

A comparative study of iterative and non-iterative feature selection techniques for software defect prediction

Author

Listed:
  • Taghi M. Khoshgoftaar

    (Florida Atlantic University)

  • Kehan Gao

    (Eastern Connecticut State University)

  • Amri Napolitano

    (Florida Atlantic University)

  • Randall Wald

    (Florida Atlantic University)

Abstract

Two important problems which can affect the performance of classification models are high-dimensionality (an overabundance of independent features in the dataset) and imbalanced data (a skewed class distribution which creates at least one class with many fewer instances than other classes). To resolve these problems concurrently, we propose an iterative feature selection approach, which repeated applies data sampling (in order to address class imbalance) followed by feature selection (in order to address high-dimensionality), and finally we perform an aggregation step which combines the ranked feature lists from the separate iterations of sampling. This approach is designed to find a ranked feature list which is particularly effective on the more balanced dataset resulting from sampling while minimizing the risk of losing data through the sampling step and missing important features. To demonstrate this technique, we employ 18 different feature selection algorithms and Random Undersampling with two post-sampling class distributions. We also investigate the use of sampling and feature selection without the iterative step (e.g., using the ranked list from a single iteration, rather than combining the lists from multiple iterations), and compare these results from the version which uses iteration. Our study is carried out using three groups of datasets with different levels of class balance, all of which were collected from a real-world software system. All of our experiments use four different learners and one feature subset size. We find that our proposed iterative feature selection approach outperforms the non-iterative approach.

Suggested Citation

  • Taghi M. Khoshgoftaar & Kehan Gao & Amri Napolitano & Randall Wald, 2014. "A comparative study of iterative and non-iterative feature selection techniques for software defect prediction," Information Systems Frontiers, Springer, vol. 16(5), pages 801-822, November.
  • Handle: RePEc:spr:infosf:v:16:y:2014:i:5:d:10.1007_s10796-013-9430-0
    DOI: 10.1007/s10796-013-9430-0
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10796-013-9430-0
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10796-013-9430-0?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Firuz Kamalov & Ho Hon Leung & Sherif Moussa, 2022. "Monotonicity of the $$\chi ^2$$ χ 2 -statistic and Feature Selection," Annals of Data Science, Springer, vol. 9(6), pages 1223-1241, December.
    2. Justin M. Johnson & Taghi M. Khoshgoftaar, 0. "The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data," Information Systems Frontiers, Springer, vol. 0, pages 1-19.
    3. Yogita Khatri & Sandeep Kumar Singh, 2023. "An effective feature selection based cross-project defect prediction model for software quality improvement," International Journal of System Assurance Engineering and Management, Springer;The Society for Reliability, Engineering Quality and Operations Management (SREQOM),India, and Division of Operation and Maintenance, Lulea University of Technology, Sweden, vol. 14(1), pages 154-172, March.
    4. Justin M. Johnson & Taghi M. Khoshgoftaar, 2020. "The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data," Information Systems Frontiers, Springer, vol. 22(5), pages 1113-1131, October.
    5. Chengcui Zhang & Elisa Bertino & Bhavani Thuraisingham & James Joshi, 2014. "Guest editorial: Information reuse, integration, and reusable systems," Information Systems Frontiers, Springer, vol. 16(5), pages 749-752, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:infosf:v:16:y:2014:i:5:d:10.1007_s10796-013-9430-0. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.