IDEAS home Printed from https://ideas.repec.org/a/spr/infosf/v22y2020i5d10.1007_s10796-020-10022-7.html
   My bibliography  Save this article

The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

Author

Listed:
  • Justin M. Johnson

    (Florida Atlantic University)

  • Taghi M. Khoshgoftaar

    (Florida Atlantic University)

Abstract

Training predictive models with class-imbalanced data has proven to be a difficult task. This problem is well studied, but the era of big data is producing more extreme levels of imbalance that are increasingly difficult to model. We use three data sets of varying complexity to evaluate data sampling strategies for treating high class imbalance with deep neural networks and big data. Sampling rates are varied to create training distributions with positive class sizes from 0.025%–90%. The area under the receiver operating characteristics curve is used to compare performance, and thresholding is used to maximize class performance. Random over-sampling (ROS) consistently outperforms under-sampling (RUS) and baseline methods. The majority class proves susceptible to misrepresentation when using RUS, and results suggest that each data set is uniquely sensitive to imbalance and sample size. The hybrid ROS-RUS maximizes performance and efficiency, and is our preferred method for treating high imbalance within big data problems.

Suggested Citation

  • Justin M. Johnson & Taghi M. Khoshgoftaar, 2020. "The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data," Information Systems Frontiers, Springer, vol. 22(5), pages 1113-1131, October.
  • Handle: RePEc:spr:infosf:v:22:y:2020:i:5:d:10.1007_s10796-020-10022-7
    DOI: 10.1007/s10796-020-10022-7
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10796-020-10022-7
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10796-020-10022-7?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. José I. Requeno & José Merseguer & Simona Bernardi & Diego Perez-Palacin & Giorgos Giotis & Vasilis Papanikolaou, 2019. "Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study," Information Systems Frontiers, Springer, vol. 21(1), pages 67-85, February.
    2. Taghi M. Khoshgoftaar & Kehan Gao & Amri Napolitano & Randall Wald, 2014. "A comparative study of iterative and non-iterative feature selection techniques for software defect prediction," Information Systems Frontiers, Springer, vol. 16(5), pages 801-822, November.
    3. Atreyi Kankanhalli & Jungpil Hahn & Sharon Tan & Gordon Gao, 2016. "Big data and analytics in healthcare: Introduction to the special section," Information Systems Frontiers, Springer, vol. 18(2), pages 233-235, April.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Lydia Bouzar-Benlabiod & Stuart H. Rubin, 2020. "Heuristic Acquisition for Data Science," Information Systems Frontiers, Springer, vol. 22(5), pages 1001-1007, October.
    2. Yoon Sang Lee & Chulhwan Chris Bang, 2022. "Framework for the Classification of Imbalanced Structured Data Using Under-sampling and Convolutional Neural Network," Information Systems Frontiers, Springer, vol. 24(6), pages 1795-1809, December.
    3. Christian Kauten & Ashish Gupta & Xiao Qin & Glenn Richey, 2022. "Predicting Blood Donors Using Machine Learning Techniques," Information Systems Frontiers, Springer, vol. 24(5), pages 1547-1562, October.
    4. Haixia Sun & Shujuan Zhang & Rui Ren & Liyang Su, 2022. "Maturity Classification of “Hupingzao” Jujubes with an Imbalanced Dataset Based on Improved MobileNet V2," Agriculture, MDPI, vol. 12(9), pages 1-16, August.
    5. Haitham Abdulmohsin Afan & Ayman Yafouz & Ahmed H. Birima & Ali Najah Ahmed & Ozgur Kisi & Barkha Chaplot & Ahmed El-Shafie, 2022. "Linear and stratified sampling-based deep learning models for improving the river streamflow forecasting to mitigate flooding disaster," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 112(2), pages 1527-1545, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Justin M. Johnson & Taghi M. Khoshgoftaar, 0. "The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data," Information Systems Frontiers, Springer, vol. 0, pages 1-19.
    2. Ashish Gupta & Amit Deokar & Lakshmi Iyer & Ramesh Sharda & Dave Schrader, 2018. "Big Data & Analytics for Societal Impact: Recent Research and Trends," Information Systems Frontiers, Springer, vol. 20(2), pages 185-194, April.
    3. Yogita Khatri & Sandeep Kumar Singh, 2023. "An effective feature selection based cross-project defect prediction model for software quality improvement," International Journal of System Assurance Engineering and Management, Springer;The Society for Reliability, Engineering Quality and Operations Management (SREQOM),India, and Division of Operation and Maintenance, Lulea University of Technology, Sweden, vol. 14(1), pages 154-172, March.
    4. Qizhi Tao & Yizhe Dong & Ziming Lin, 2017. "Who can get money? Evidence from the Chinese peer-to-peer lending platform," Information Systems Frontiers, Springer, vol. 19(3), pages 425-441, June.
    5. Saba Bashir & Usman Qamar & Farhan Hassan Khan, 0. "WebMAC: A web based clinical expert system," Information Systems Frontiers, Springer, vol. 0, pages 1-17.
    6. Prabhsimran Singh & Surleen Kaur & Abdullah M. Baabdullah & Yogesh K. Dwivedi & Sandeep Sharma & Ravinder Singh Sawhney & Ronnie Das, 2023. "Is #SDG13 Trending Online? Insights from Climate Change Discussions on Twitter," Information Systems Frontiers, Springer, vol. 25(1), pages 199-219, February.
    7. Qizhi Tao & Yizhe Dong & Ziming Lin, 0. "Who can get money? Evidence from the Chinese peer-to-peer lending platform," Information Systems Frontiers, Springer, vol. 0, pages 1-17.
    8. Chengcui Zhang & Elisa Bertino & Bhavani Thuraisingham & James Joshi, 2014. "Guest editorial: Information reuse, integration, and reusable systems," Information Systems Frontiers, Springer, vol. 16(5), pages 749-752, November.
    9. Yogesh K. Dwivedi & Marijn Janssen & Emma L. Slade & Nripendra P. Rana & Vishanth Weerakkody & Jeremy Millard & Jan Hidders & Dhoya Snijders, 0. "Driving innovation through big open linked data (BOLD): Exploring antecedents using interpretive structural modelling," Information Systems Frontiers, Springer, vol. 0, pages 1-16.
    10. Bram Klievink & Bart-Jan Romijn & Scott Cunningham & Hans Bruijn, 0. "Big data in the public sector: Uncertainties and readiness," Information Systems Frontiers, Springer, vol. 0, pages 1-17.
    11. Bendik Bygstad & Egil Øvrelid & Thomas Lie & Magnus Bergquist, 0. "Developing and Organizing an Analytics Capability for Patient Flow in a General Hospital," Information Systems Frontiers, Springer, vol. 0, pages 1-12.
    12. Firuz Kamalov & Ho Hon Leung & Sherif Moussa, 2022. "Monotonicity of the $$\chi ^2$$ χ 2 -statistic and Feature Selection," Annals of Data Science, Springer, vol. 9(6), pages 1223-1241, December.
    13. Xuan Wang & Jun Sun & Ying Wang & Yi Liu, 2022. "Deepen electronic health record diffusion beyond breadth: game changers and decision drivers," Information Systems Frontiers, Springer, vol. 24(2), pages 537-548, April.
    14. Bram Klievink & Bart-Jan Romijn & Scott Cunningham & Hans Bruijn, 2017. "Big data in the public sector: Uncertainties and readiness," Information Systems Frontiers, Springer, vol. 19(2), pages 267-283, April.
    15. Venugopal Gopalakrishna-Remani & Robert Paul Jones & Kerri M. Camp, 2019. "Levels of EMR Adoption in U.S. Hospitals: An Empirical Examination of Absorptive Capacity, Institutional Pressures, Top Management Beliefs, and Participation," Information Systems Frontiers, Springer, vol. 21(6), pages 1325-1344, December.
    16. Saba Bashir & Usman Qamar & Farhan Hassan Khan, 2018. "WebMAC: A web based clinical expert system," Information Systems Frontiers, Springer, vol. 20(5), pages 1135-1151, October.
    17. Yiğit Kazançoğlu & Muhittin Sağnak & Çisem Lafcı & Sunil Luthra & Anil Kumar & Caner Taçoğlu, 2021. "Big Data-Enabled Solutions Framework to Overcoming the Barriers to Circular Economy Initiatives in Healthcare Sector," IJERPH, MDPI, vol. 18(14), pages 1-21, July.
    18. Thouraya Bouabana-Tebibel & Stuart H. Rubin & Lydia Bouzar-Benlabiod, 2019. "Guest Editorial: Recent Trends in Reuse and Integration," Information Systems Frontiers, Springer, vol. 21(1), pages 1-3, February.
    19. Yogesh K. Dwivedi & Marijn Janssen & Emma L. Slade & Nripendra P. Rana & Vishanth Weerakkody & Jeremy Millard & Jan Hidders & Dhoya Snijders, 2017. "Driving innovation through big open linked data (BOLD): Exploring antecedents using interpretive structural modelling," Information Systems Frontiers, Springer, vol. 19(2), pages 197-212, April.
    20. Hsu-Hua Ho & Jien-Jou Lin & Jia-Qiao Gong & Tzu-Yi Yu, 2022. "An Empirical Study for Senior Citizens Using a Customized Medical Informatics System for Dementia Diagnosis and Analysis," Sustainability, MDPI, vol. 14(15), pages 1-22, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:infosf:v:22:y:2020:i:5:d:10.1007_s10796-020-10022-7. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.