IDEAS home Printed from https://ideas.repec.org/a/spr/jclass/v41y2024i1d10.1007_s00357-024-09463-5.html
   My bibliography  Save this article

Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

Author

Listed:
  • Aboubacry Gaye

    (Laboratory for Studies and Research in Statistics and Development, Gaston Berger University of Saint Louis
    Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar)

  • Abdou Ka Diongue

    (Laboratory for Studies and Research in Statistics and Development, Gaston Berger University of Saint Louis)

  • Seydou Nourou Sylla

    (Information and Communication Technologies for Development, Alioune Diop University of Bambey)

  • Maryam Diarra

    (Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar)

  • Amadou Diallo

    (Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar)

  • Cheikh Talla

    (Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar)

  • Cheikh Loucoubar

    (Epidemiology, Clinical Research and Data Science Unit, Institut Pasteur de Dakar)

Abstract

This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

Suggested Citation

  • Aboubacry Gaye & Abdou Ka Diongue & Seydou Nourou Sylla & Maryam Diarra & Amadou Diallo & Cheikh Talla & Cheikh Loucoubar, 2024. "Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data," Journal of Classification, Springer;The Classification Society, vol. 41(1), pages 158-169, March.
  • Handle: RePEc:spr:jclass:v:41:y:2024:i:1:d:10.1007_s00357-024-09463-5
    DOI: 10.1007/s00357-024-09463-5
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00357-024-09463-5
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00357-024-09463-5?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:jclass:v:41:y:2024:i:1:d:10.1007_s00357-024-09463-5. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.