IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0269135.html
   My bibliography  Save this article

An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis

Author

Listed:
  • Tuan Tran
  • Uyen Le
  • Yihui Shi

Abstract

Early detection of breast cancer plays a critical role in successful treatment that saves thousands of lives of patients every year. Despite massive clinical data have been collected and stored by healthcare organizations, only a small portion of the data has been used to support decision-making for treatments. In this study, we proposed an engineered up-sampling method (ENUS) for handling imbalanced data to improve predictive performance of machine learning models. Our experiment results showed that when the ratio of the minority to the majority class is less than 20%, training models with ENUS improved the balanced accuracy 3.74%, sensitivity 8.36% and F1 score 3.83%. Our study also identified that XGBoost Tree (XGBTree) using ENUS achieved the best performance with an average balanced accuracy of 97.47% (min = 93%, max = 100%), sensitivity of 97.88% (min = 89% and max = 100%), and F1 score of 96.20% (min = 89.5%, max = 100%) in the validation dataset. Furthermore, our ensemble algorithm identified Cell_Shape and Nuclei as the most important attributes in predicting breast cancer. The finding re-affirms the previous knowledge of the relationship between Cell_Shape, Nuclei, and the grades of breast cancer using a data-driven approach. Finally, our experiment showed that Random Forest and Neural Network models had the least training time. Our study provided a comprehensive comparison of a wide range of machine learning methods in predicting breast cancer risk. It can be used as a tool for healthcare practitioners to effectively detect and treat breast cancer.

Suggested Citation

  • Tuan Tran & Uyen Le & Yihui Shi, 2022. "An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis," PLOS ONE, Public Library of Science, vol. 17(5), pages 1-30, May.
  • Handle: RePEc:plo:pone00:0269135
    DOI: 10.1371/journal.pone.0269135
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0269135
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0269135&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0269135?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Gigi F Stark & Gregory R Hart & Bradley J Nartowt & Jun Deng, 2019. "Predicting breast cancer risk using personal health data and machine learning models," PLOS ONE, Public Library of Science, vol. 14(12), pages 1-17, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Raphael Kirchgaessner & Cameron Watson & Allison Creason & Kaya Keutler & Jeremy Goecks, 2025. "Imputing single-cell protein abundance in multiplex tissue imaging," Nature Communications, Nature, vol. 16(1), pages 1-14, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ervasti, Jenni & Pentti, Jaana & Seppälä, Piia & Ropponen, Annina & Virtanen, Marianna & Elovainio, Marko & Chandola, Tarani & Kivimäki, Mika & Airaksinen, Jaakko, 2023. "Prediction of bullying at work: A data-driven analysis of the Finnish public sector cohort study," Social Science & Medicine, Elsevier, vol. 317(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0269135. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.