IDEAS home Printed from https://ideas.repec.org/a/bpj/sagmbi/v16y2017i3p199-216n3.html
   My bibliography  Save this article

Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration

Author

Listed:
  • Islam Shofiqul

    (Population Health Research Institute, McMaster University and Hamilton Health Sciences, Hamilton, Ontario, Canada)

  • Anand Sonia

    (Population Health Research Institute, McMaster University and Hamilton Health Sciences, Hamilton, Ontario, Canada)

  • Hamid Jemila

    (Department of Medicine, McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4K1, Canada)

  • Thabane Lehana

    (Population Health Research Institute, McMaster University and Hamilton Health Sciences, Hamilton, Ontario, Canada)

  • Beyene Joseph

    (Department of Medicine, McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4K1, Canada)

Abstract

Linear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.

Suggested Citation

  • Islam Shofiqul & Anand Sonia & Hamid Jemila & Thabane Lehana & Beyene Joseph, 2017. "Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 16(3), pages 199-216, August.
  • Handle: RePEc:bpj:sagmbi:v:16:y:2017:i:3:p:199-216:n:3
    DOI: 10.1515/sagmb-2016-0066
    as

    Download full text from publisher

    File URL: https://doi.org/10.1515/sagmb-2016-0066
    Download Restriction: For access to full text, subscription to the journal or payment for the individual article is required.

    File URL: https://libkey.io/10.1515/sagmb-2016-0066?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Xiaobo Guo & Ye Zhang & Wenhao Hu & Haizhu Tan & Xueqin Wang, 2014. "Inferring Nonlinear Gene Regulatory Networks from Gene Expression Data Based on Distance Correlation," PLOS ONE, Public Library of Science, vol. 9(2), pages 1-7, February.
    2. Jessica Minnier & Ming Yuan & Jun S. Liu & Tianxi Cai, 2015. "Risk Classification With an Adaptive Naive Bayes Kernel Machine Model," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(509), pages 393-404, March.
    3. Diana Chang & Alon Keinan, 2014. "Principal Component Analysis Characterizes Shared Pathogenetics from Genome-Wide Association Studies," PLOS Computational Biology, Public Library of Science, vol. 10(9), pages 1-14, September.
    4. Aguilera, Ana M. & Escabias, Manuel & Valderrama, Mariano J., 2006. "Using principal components for estimating logistic regression with high-dimensional multicollinear data," Computational Statistics & Data Analysis, Elsevier, vol. 50(8), pages 1905-1924, April.
    5. Karatzoglou, Alexandros & Smola, Alexandros & Hornik, Kurt & Zeileis, Achim, 2004. "kernlab - An S4 Package for Kernel Methods in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 11(i09).
    6. W. Gibson, 1959. "Three multivariate models: Factor analysis, latent structure analysis, and latent profile analysis," Psychometrika, Springer;The Psychometric Society, vol. 24(3), pages 229-252, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tsukioka, Yasutomo & Yanagi, Junya & Takada, Teruko, 2018. "Investor sentiment extracted from internet stock message boards and IPO puzzles," International Review of Economics & Finance, Elsevier, vol. 56(C), pages 205-217.
    2. Daniel J. Luckett & Eric B. Laber & Samer S. El‐Kamary & Cheng Fan & Ravi Jhaveri & Charles M. Perou & Fatma M. Shebl & Michael R. Kosorok, 2021. "Receiver operating characteristic curves and confidence bands for support vector machines," Biometrics, The International Biometric Society, vol. 77(4), pages 1422-1430, December.
    3. Yanzhu Hu & Huiyang Zhao & Xinbo Ai, 2016. "Inferring Weighted Directed Association Network from Multivariate Time Series with a Synthetic Method of Partial Symbolic Transfer Entropy Spectrum and Granger Causality," PLOS ONE, Public Library of Science, vol. 11(11), pages 1-25, November.
    4. Grabisch, Michel & Kojadinovic, Ivan & Meyer, Patrick, 2008. "A review of methods for capacity identification in Choquet integral based multi-attribute utility theory: Applications of the Kappalab R package," European Journal of Operational Research, Elsevier, vol. 186(2), pages 766-785, April.
    5. Bellotti, Anthony & Brigo, Damiano & Gambetti, Paolo & Vrins, Frédéric, 2021. "Forecasting recovery rates on non-performing loans with machine learning," International Journal of Forecasting, Elsevier, vol. 37(1), pages 428-444.
    6. Riza, Lala Septem & Bergmeir, Christoph & Herrera, Francisco & Benítez, José M., 2015. "frbs: Fuzzy Rule-Based Systems for Classification and Regression in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 65(i06).
    7. Karin Wolffhechel & Amanda C Hahn & Hanne Jarmer & Claire I Fisher & Benedict C Jones & Lisa M DeBruine, 2015. "Testing the Utility of a Data-Driven Approach for Assessing BMI from Face Images," PLOS ONE, Public Library of Science, vol. 10(10), pages 1-10, October.
    8. Lucadamo, Antonio & Camminatiello, Ida & D'Ambra, Antonello, 2021. "A statistical model for evaluating the patient satisfaction," Socio-Economic Planning Sciences, Elsevier, vol. 73(C).
    9. Nagarajah Varathan & Pushpakanthie Wijekoon, 2019. "Logistic Liu Estimator under stochastic linear restrictions," Statistical Papers, Springer, vol. 60(3), pages 945-962, June.
    10. Andrea S Martinez-Vernon & James A Covington & Ramesh P Arasaradnam & Siavash Esfahani & Nicola O’Connell & Ioannis Kyrou & Richard S Savage, 2018. "An improved machine learning pipeline for urinary volatiles disease detection: Diagnosing diabetes," PLOS ONE, Public Library of Science, vol. 13(9), pages 1-20, September.
    11. López-Delgado, P. & Diéguez-Soto, J., 2015. "Lone founders, types of private family businesses and firm performance," Journal of Family Business Strategy, Elsevier, vol. 6(2), pages 73-85.
    12. Geert Soete & Willem Heiser, 1993. "A latent class unfolding model for analyzing single stimulus preference ratings," Psychometrika, Springer;The Psychometric Society, vol. 58(4), pages 545-565, December.
    13. Khamma, Thulasi Ram & Zhang, Yuming & Guerrier, Stéphane & Boubekri, Mohamed, 2020. "Generalized additive models: An efficient method for short-term energy prediction in office buildings," Energy, Elsevier, vol. 213(C).
    14. Madhumita Sahoo & Aman Kasot & Anirban Dhar & Amlanjyoti Kar, 2018. "On Predictability of Groundwater Level in Shallow Wells Using Satellite Observations," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 32(4), pages 1225-1244, March.
    15. P. J. Zarco-Tejada & T. Poblete & C. Camino & V. Gonzalez-Dugo & R. Calderon & A. Hornero & R. Hernandez-Clemente & M. Román-Écija & M. P. Velasco-Amo & B. B. Landa & P. S. A. Beck & M. Saponari & D. , 2021. "Divergent abiotic spectral pathways unravel pathogen stress signals across species," Nature Communications, Nature, vol. 12(1), pages 1-11, December.
    16. Fernández-Alcalá, R.M. & Navarro-Moreno, J. & Ruiz-Molina, J.C., 2009. "Statistical inference for doubly stochastic multichannel Poisson processes: A PCA approach," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4322-4331, October.
    17. Grubinger, Thomas & Zeileis, Achim & Pfeiffer, Karl-Peter, 2014. "evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i01).
    18. Meisam Moghimbeygi & Anahita Nodehi, 2022. "Multinomial Principal Component Logistic Regression on Shape Data," Journal of Classification, Springer;The Classification Society, vol. 39(3), pages 578-599, November.
    19. Uwe Ligges & Sebastian Krey, 2011. "Feature clustering for instrument classification," Computational Statistics, Springer, vol. 26(2), pages 279-291, June.
    20. Arnout Van Messem & Andreas Christmann, 2010. "A review on consistency and robustness properties of support vector machines for heavy-tailed distributions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 4(2), pages 199-220, September.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bpj:sagmbi:v:16:y:2017:i:3:p:199-216:n:3. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Peter Golla (email available below). General contact details of provider: https://www.degruyter.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.