IDEAS home Printed from https://ideas.repec.org/a/spr/annopr/v348y2025i1d10.1007_s10479-023-05691-x.html
   My bibliography  Save this article

A framework to predict second primary lung cancer patients by using ensemble models

Author

Listed:
  • Yen-Chun Huang

    (Tamkang University)

  • Chieh-Wen Ho

    (Department of Biology, Texas A&M University)

  • Wen-Ru Chou

    (Fu Jen Catholic University
    Fu Jen Catholic University)

  • Mingchih Chen

    (Fu Jen Catholic University
    Fu Jen Catholic University)

Abstract

Machine learning (ML) model prediction, which has been wildly used in healthcare industry recently, serves as a tool to help users to make quick decisions. The prediction results could improve treatment outcomes and reduce the medical expenses. This research proposed the ML-based decision tool to predict the second primary lung cancer probability within lung cancer patients. This tool included following stages: The first stage is data processing to select the target patients by using National Health Insurance Research Database from 2011 to 2016 period as study. The second stage has used synthetic minority oversampling technique (SMOTE) to make data balancing. The third stage is feature selecting, and in final stage, we have applied five ML algorithms, which is included: Logistic Regression (LGR), Decision Tree, Random Forests (RF), multivariate adaptive regression splines (MARS), and extreme gradient boosting (XGBoost) with optimal features, then followed by building ensemble models. The results show that after feature selection, the ensemble models yield an accuracy rate 0.932. Different types of therapy (Chemotherapy (CH); Radiotherapy (RT), tyrosine kinase inhibitor (TKI)), different clinical stages, and Epidermal Growth Factor Receptor (EGFR) states were the top five optimal features affecting developed second primary lung cancer. This study can help physicians to identify the possibility with second primary lung cancer patients and make complete treatment plans for them.

Suggested Citation

  • Yen-Chun Huang & Chieh-Wen Ho & Wen-Ru Chou & Mingchih Chen, 2025. "A framework to predict second primary lung cancer patients by using ensemble models," Annals of Operations Research, Springer, vol. 348(1), pages 373-397, May.
  • Handle: RePEc:spr:annopr:v:348:y:2025:i:1:d:10.1007_s10479-023-05691-x
    DOI: 10.1007/s10479-023-05691-x
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10479-023-05691-x
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10479-023-05691-x?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Gregorutti, Baptiste & Michel, Bertrand & Saint-Pierre, Philippe, 2015. "Grouped variable importance with random forests and application to multiple functional data analysis," Computational Statistics & Data Analysis, Elsevier, vol. 90(C), pages 15-35.
    2. repec:plo:pone00:0048528 is not listed on IDEAS
    3. Talayeh Razzaghi & Ilya Safro & Joseph Ewing & Ehsan Sadrfaridpour & John D. Scott, 2019. "Predictive models for bariatric surgery risks with imbalanced medical datasets," Annals of Operations Research, Springer, vol. 280(1), pages 1-18, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Pedro Delicado & Daniel Peña, 2023. "Understanding complex predictive models with ghost variables," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 32(1), pages 107-145, March.
    2. Fabrizio Maturo & Rosanna Verde, 2023. "Supervised classification of curves via a combined use of functional data analysis and tree-based methods," Computational Statistics, Springer, vol. 38(1), pages 419-459, March.
    3. Patrick J. Comer & Jon C. Hak & Marion S. Reid & Stephanie L. Auer & Keith A. Schulz & Healy H. Hamilton & Regan L. Smyth & Matthew M. Kling, 2019. "Habitat Climate Change Vulnerability Index Applied to Major Vegetation Types of the Western Interior United States," Land, MDPI, vol. 8(7), pages 1-27, July.
    4. Simon Valentin & Maximilian Harkotte & Tzvetan Popov, 2020. "Interpreting neural decoding models using grouped model reliance," PLOS Computational Biology, Public Library of Science, vol. 16(1), pages 1-17, January.
    5. Che Xu & Wenjun Chang & Weiyong Liu, 2023. "Data-driven decision model based on local two-stage weighted ensemble learning," Annals of Operations Research, Springer, vol. 325(2), pages 995-1028, June.
    6. Neska Haouij & Jean-Michel Poggi & Raja Ghozi & Sylvie Sevestre-Ghalila & Mériem Jaïdane, 2019. "Random forest-based approach for physiological functional variable selection for driver’s stress level classification," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 28(1), pages 157-185, March.
    7. Manrui Jiang & Lifen Jia & Zhensong Chen & Wei Chen, 2022. "The two-stage machine learning ensemble models for stock price prediction by combining mode decomposition, extreme learning machine and improved harmony search algorithm," Annals of Operations Research, Springer, vol. 309(2), pages 553-585, February.
    8. Zhou, Zhipeng & Zhuo, Wen & Cui, Jianqiang & Luan, Haiying & Chen, Yudi & Lin, Dong, 2025. "Developing a deep reinforcement learning model for safety risk prediction at subway construction sites," Reliability Engineering and System Safety, Elsevier, vol. 257(PB).
    9. Waqar Ahmed Khan, 2025. "Balanced weighted extreme learning machine for imbalance learning of credit default risk and manufacturing productivity," Annals of Operations Research, Springer, vol. 348(2), pages 833-861, May.
    10. Xiaomeng Ju & Matías Salibián-Barrera, 2024. "Tree-based boosting with functional data," Computational Statistics, Springer, vol. 39(3), pages 1587-1620, May.
    11. A. Poterie & J.-F. Dupuy & V. Monbet & L. Rouvière, 2019. "Classification tree algorithm for grouped variables," Computational Statistics, Springer, vol. 34(4), pages 1613-1648, December.
    12. T. Górecki & Ł. Smaga, 2017. "Multivariate analysis of variance for functional data," Journal of Applied Statistics, Taylor & Francis Journals, vol. 44(12), pages 2172-2189, September.
    13. Viswanath Venkatesh, 2022. "Adoption and use of AI tools: a research agenda grounded in UTAUT," Annals of Operations Research, Springer, vol. 308(1), pages 641-652, January.
    14. Epifanio, Irene, 2016. "Functional archetype and archetypoid analysis," Computational Statistics & Data Analysis, Elsevier, vol. 104(C), pages 24-34.
    15. Pedro Delicado, 2019. "Comments on: Data science, big data and statistics," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(2), pages 334-337, June.
    16. Antoniadis, Anestis & Lambert-Lacroix, Sophie & Poggi, Jean-Michel, 2021. "Random forests for global sensitivity analysis: A selective review," Reliability Engineering and System Safety, Elsevier, vol. 206(C).
    17. Liu, Zhenkun & De Bock, Koen W. & Zhang, Lifang, 2025. "Explainable profit-driven hotel booking cancellation prediction based on heterogeneous stacking-based ensemble classification," European Journal of Operational Research, Elsevier, vol. 321(1), pages 284-301.
    18. Christophe Denis & Charlotte Dion & Miguel Martinez, 2020. "Consistent procedures for multiclass classification of discrete diffusion paths," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 47(2), pages 516-554, June.

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:annopr:v:348:y:2025:i:1:d:10.1007_s10479-023-05691-x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.