IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0342160.html

A hybrid feature extraction framework combining PCA and mutual information for gene expression based lung cancer classification

Author

Listed:
  • Syed Naseer Ahmad Shah
  • Kaartik Issar
  • Rafat Parveen

Abstract

Lung cancer remains a leading cause of cancer-related mortality worldwide, with early and accurate diagnosis posing a critical challenge for improving patient outcomes. Gene expression data provide crucial insights for lung cancer classification by revealing underlying biological mechanisms. However, the high dimensionality of such data presents challenges, including computational complexity and overfitting risks. This study proposes a hybrid feature extraction framework combining Principal Component Analysis (PCA) and Mutual Information (MI) to address these issues. PCA reduces dimensionality by capturing key variance patterns, while MI selects features highly relevant to the target class, ensuring an informative and concise feature set. Gene expression datasets from The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) were integrated, focusing on common genes. The hybrid PCA-MI framework was applied to rank genes, and the selected features were used to train a Convolutional Neural Network (CNN) for lung cancer classification. The genes ranked by the hybrid model were further analysed using protein-protein interaction (PPI) networks to identify hub genes, enhancing biological interpretability. The proposed framework was benchmarked against ten other feature extraction methods, including Lasso, Random Forest, Autoencoder, and PCA alone. The CNN classifier achieved superior performance with the PCA-MI features, attaining 98% accuracy and 98% precision. Training and validation curves demonstrated stable learning behaviour, and confusion matrix analysis confirmed robust predictions. Hub gene identification through PPI analysis validated the biological significance of the ranked genes. This study presents a robust framework for lung cancer classification by leveraging the strengths of PCA and MI, integrating deep learning and PPI analysis to address high-dimensional data challenges, and setting a foundation for future research in multi-omics data integration and enhanced diagnostic strategies.

Suggested Citation

  • Syed Naseer Ahmad Shah & Kaartik Issar & Rafat Parveen, 2026. "A hybrid feature extraction framework combining PCA and mutual information for gene expression based lung cancer classification," PLOS ONE, Public Library of Science, vol. 21(2), pages 1-28, February.
  • Handle: RePEc:plo:pone00:0342160
    DOI: 10.1371/journal.pone.0342160
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0342160
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0342160&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0342160?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Roy S. Herbst & Daniel Morgensztern & Chris Boshoff, 2018. "The biology and management of non-small cell lung cancer," Nature, Nature, vol. 553(7689), pages 446-454, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Xiaobo Yang & Zhilong Mi & Qingcai He & Binghui Guo & Zhiming Zheng, 2023. "Identification of Vital Genes for NSCLC Integrating Mutual Information and Synergy," Mathematics, MDPI, vol. 11(6), pages 1-15, March.
    2. Xiang Ge Luo & Jack Kuipers & Niko Beerenwinkel, 2023. "Joint inference of exclusivity patterns and recurrent trajectories from tumor mutation trees," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    3. Chengdi Wang & Jingwei Li & Jingyao Chen & Zhoufeng Wang & Guonian Zhu & Lujia Song & Jiayang Wu & Changshu Li & Rong Qiu & Xuelan Chen & Li Zhang & Weimin Li, 2025. "Multi-omics analyses reveal biological and clinical insights in recurrent stage I non-small cell lung cancer," Nature Communications, Nature, vol. 16(1), pages 1-19, December.
    4. Michael J. P. Crowley & Bhavneet Bhinder & Geoffrey J. Markowitz & Mitchell Martin & Akanksha Verma & Tito A. Sandoval & Chang-Suk Chae & Shira Yomtoubian & Yang Hu & Sahil Chopra & Diamile A. Tavarez, 2023. "Tumor-intrinsic IRE1α signaling controls protective immunity in lung cancer," Nature Communications, Nature, vol. 14(1), pages 1-16, December.
    5. Chen Ni & Xiaohan Lou & Xiaohan Yao & Linlin Wang & Jiajia Wan & Xixi Duan & Jialu Liang & Kaili Zhang & Yuanyuan Yang & Li Zhang & Chanjun Sun & Zhenzhen Li & Ming Wang & Linyu Zhu & Dekang Lv & Zhih, 2022. "ZIP1+ fibroblasts protect lung cancer against chemotherapy via connexin-43 mediated intercellular Zn2+ transfer," Nature Communications, Nature, vol. 13(1), pages 1-20, December.
    6. Feras E. Machour & Enas R. Abu-Zhayia & Joyce Kamar & Alma Sophia Barisaac & Itamar Simon & Nabieh Ayoub, 2024. "Harnessing DNA replication stress to target RBM10 deficiency in lung adenocarcinoma," Nature Communications, Nature, vol. 15(1), pages 1-18, December.
    7. Tatjana Sajic & Matej Vizovišek & Stephan Arni & Rodolfo Ciuffa & Martin Mehnert & Sébastien Lenglet & Walter Weder & Hector Gallart-Ayala & Julijana Ivanisevic & Marija Buljan & Aurelien Thomas & Sve, 2025. "Depletion-dependent activity-based protein profiling using SWATH/DIA-MS detects serine hydrolase lipid remodeling in lung adenocarcinoma progression," Nature Communications, Nature, vol. 16(1), pages 1-24, December.
    8. Zebin Gao & Guoxun Zhang & Hengrui Liang & Jiaxin Liu & Liangdi Ma & Tianyun Wang & Yanchen Guo & YuJia Chen & Zeping Yan & Xiangru Chen & Jianxing He & Feng Xu & Tien Yin Wong & Yuchen Guo & Qionghai, 2026. "A lung CT vision foundation model facilitating disease diagnosis and medical imaging," Nature Communications, Nature, vol. 17(1), pages 1-17, December.
    9. Meiting Yue & Zhen Qin & Shijie Tang & Xinlei Cai & Yikai Zhao & Chen Yang & Liang Chen & Luonan Chen & Hongbin Ji, 2026. "Concurrent PIK3CA mutant promotes cachexia through inflammatory signaling in EGFR mutant lung cancer," Nature Communications, Nature, vol. 17(1), pages 1-14, December.
    10. Das, Abhijeet & Sehgal, Manas & Singh, Ashwini & Goyal, Rishabh & Prabhakar, Mallika & Fricke, Jeremy & Mambetsariev, Isa & Kulkarni, Prakash & Jolly, Mohit Kumar & Salgia, Ravi, 2025. "DNA walk of specific fused oncogenes exhibit distinct fractal geometric characteristics in nucleotide patterns," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 662(C).
    11. Wei Jiang & Qitao Yu, 2019. "LKB1, a Key Driver Gene of Human Lung Squamous Cell Carcinoma," Biomedical Journal of Scientific & Technical Research, Biomedical Research Network+, LLC, vol. 19(3), pages 14335-14336, July.
    12. Meng Nie & Ke Yao & Xinsheng Zhu & Na Chen & Nan Xiao & Yi Wang & Bo Peng & LiAng Yao & Peng Li & Peng Zhang & Zeping Hu, 2021. "Evolutionary metabolic landscape from preneoplasia to invasive lung adenocarcinoma," Nature Communications, Nature, vol. 12(1), pages 1-13, December.
    13. Yu-Yang Bi & Qiu Chen & Ming-Yuan Yang & Lei Xing & Hu-Lin Jiang, 2024. "Nanoparticles targeting mutant p53 overcome chemoresistance and tumor recurrence in non-small cell lung cancer," Nature Communications, Nature, vol. 15(1), pages 1-18, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0342160. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.