IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1010184.html
   My bibliography  Save this article

AC-PCoA: Adjustment for confounding factors using principal coordinate analysis

Author

Listed:
  • Yu Wang
  • Fengzhu Sun
  • Wei Lin
  • Shuqin Zhang

Abstract

Confounding factors exist widely in various biological data owing to technical variations, population structures and experimental conditions. Such factors may mask the true signals and lead to spurious associations in the respective biological data, making it necessary to adjust confounding factors accordingly. However, existing confounder correction methods were mainly developed based on the original data or the pairwise Euclidean distance, either one of which is inadequate for analyzing different types of data, such as sequencing data.In this work, we proposed a method called Adjustment for Confounding factors using Principal Coordinate Analysis, or AC-PCoA, which reduces data dimension and extracts the information from different distance measures using principal coordinate analysis, and adjusts confounding factors across multiple datasets by minimizing the associations between lower-dimensional representations and confounding variables. Application of the proposed method was further extended to classification and prediction. We demonstrated the efficacy of AC-PCoA on three simulated datasets and five real datasets. Compared to the existing methods, AC-PCoA shows better results in visualization, statistical testing, clustering, and classification.Author summary: With today’s unprecedented amount of data, researchers are challenged by the need to enhance meaningful signals without the interference of unwanted confounders hidden inside the data. Data visualization is an important step toward exploring and explaining data in order to intuitively identify the dominant patterns. Principal coordinate analysis (PCoA), as a visualization tool, allows flexible ways to define pairwise distances and project the samples into lower dimensions without changing the distances. However, when visualizing large-scale biological datasets, the true patterns are often hindered by unwanted confounding variations, either biologically or technically in origin. To eliminate these confounding factors and recover underlying signals, we proposed a method called Adjustment for Confounding factors using Principal Coordinate Analysis, or AC-PCoA, and showed that it significantly outperforms existing methods in visualization through three simulation studies and five real datasets. We further showed that the low-dimensional representations given by AC-PCoA provide promising results in statistical testing, clustering, and classification as well.

Suggested Citation

  • Yu Wang & Fengzhu Sun & Wei Lin & Shuqin Zhang, 2022. "AC-PCoA: Adjustment for confounding factors using principal coordinate analysis," PLOS Computational Biology, Public Library of Science, vol. 18(7), pages 1-21, July.
  • Handle: RePEc:plo:pcbi00:1010184
    DOI: 10.1371/journal.pcbi.1010184
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010184
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1010184&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1010184?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jeffrey T Leek & John D Storey, 2007. "Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis," PLOS Genetics, Public Library of Science, vol. 3(9), pages 1-12, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Arjun Bhattacharya & Anastasia N. Freedman & Vennela Avula & Rebeca Harris & Weifang Liu & Calvin Pan & Aldons J. Lusis & Robert M. Joseph & Lisa Smeester & Hadley J. Hartwell & Karl C. K. Kuban & Car, 2022. "Placental genomics mediates genetic associations with complex health traits and disease," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    2. repec:jss:jstsof:40:i14 is not listed on IDEAS
    3. Lily Monnier & Paul-Henry Cournède, 2024. "A novel batch-effect correction method for scRNA-seq data based on Adversarial Information Factorization," PLOS Computational Biology, Public Library of Science, vol. 20(2), pages 1-22, February.
    4. Wesley L Crouse & Gregory R Keele & Madeleine S Gastonguay & Gary A Churchill & William Valdar, 2022. "A Bayesian model selection approach to mediation analysis," PLOS Genetics, Public Library of Science, vol. 18(5), pages 1-33, May.
    5. Won Jun Lee & Sang Cheol Kim & Jung-Ho Yoon & Sang Jun Yoon & Johan Lim & You-Sun Kim & Sung Won Kwon & Jeong Hill Park, 2016. "Meta-Analysis of Tumor Stem-Like Breast Cancer Cells Using Gene Set and Network Analysis," PLOS ONE, Public Library of Science, vol. 11(2), pages 1-20, February.
    6. repec:plo:pgen00:1002078 is not listed on IDEAS
    7. Emanuele Aliverti & Kristian Lum & James E. Johndrow & David B. Dunson, 2021. "Removing the influence of group variables in high‐dimensional predictive modelling," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(3), pages 791-811, July.
    8. Marron, J.S., 2017. "Big Data in context and robustness against heterogeneity," Econometrics and Statistics, Elsevier, vol. 2(C), pages 73-80.
    9. Seungchul Baek & Yen‐Yi Ho & Yanyuan Ma, 2020. "Using sufficient direction factor model to analyze latent activities associated with breast cancer survival," Biometrics, The International Biometric Society, vol. 76(4), pages 1340-1350, December.
    10. Griffin, Maryclare & Hoff, Peter D., 2019. "Lasso ANOVA decompositions for matrix and tensor data," Computational Statistics & Data Analysis, Elsevier, vol. 137(C), pages 181-194.
    11. Yunfeng Li & Jarrett Morrow & Benjamin Raby & Kelan Tantisira & Scott T Weiss & Wei Huang & Weiliang Qiu, 2017. "Detecting disease-associated genomic outcomes using constrained mixture of Bayesian hierarchical models for paired data," PLOS ONE, Public Library of Science, vol. 12(3), pages 1-16, March.
    12. Zhaohui Qin & Ben Li & Karen N. Conneely & Hao Wu & Ming Hu & Deepak Ayyala & Yongseok Park & Victor X. Jin & Fangyuan Zhang & Han Zhang & Li Li & Shili Lin, 2016. "Statistical Challenges in Analyzing Methylation and Long-Range Chromosomal Interaction Data," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 8(2), pages 284-309, October.
    13. Delaram Pouyabahar & Tallulah Andrews & Gary D. Bader, 2025. "Interpretable single-cell factor decomposition using sciRED," Nature Communications, Nature, vol. 16(1), pages 1-16, December.
    14. Zemin Zheng & Jinchi Lv & Wei Lin, 2021. "Nonsparse Learning with Latent Variables," Operations Research, INFORMS, vol. 69(1), pages 346-359, January.
    15. Chee Ho H’ng & Shanika L. Amarasinghe & Boya Zhang & Hojin Chang & Xinli Qu & David R. Powell & Alberto Rosello-Diez, 2024. "Compensatory growth and recovery of cartilage cytoarchitecture after transient cell death in fetal mouse limbs," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    16. Mark Reimers, 2010. "Making Informed Choices about Microarray Data Analysis," PLOS Computational Biology, Public Library of Science, vol. 6(5), pages 1-7, May.
    17. Leek Jeffrey T & Storey John D., 2011. "The Joint Null Criterion for Multiple Hypothesis Tests," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-22, June.
    18. Arezo Torang & Aleksandar B. Kirov & Veerle Lammers & Kate Cameron & Valérie M. Wouters & Rene F. Jackstadt & Tamsin R. M. Lannagan & Joan H. Jong & Jan Koster & Owen Sansom & Jan Paul Medema, 2025. "Enterocyte-like differentiation defines metabolic gene signatures of CMS3 colorectal cancers and provides therapeutic vulnerability," Nature Communications, Nature, vol. 16(1), pages 1-16, December.
    19. Christos Miliotis & Yuling Ma & Xanthi-Lida Katopodi & Dimitra Karagkouni & Eleni Kanata & Kaia Mattioli & Nikolas Kalavros & Yered H. Pita-Juárez & Felipe Batalini & Varune R. Ramnarine & Shivani Nan, 2024. "Determinants of gastric cancer immune escape identified from non-coding immune-landscape quantitative trait loci," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    20. Nicoló Fusi & Oliver Stegle & Neil D Lawrence, 2012. "Joint Modelling of Confounding Factors and Prominent Genetic Regulators Provides Increased Accuracy in Genetical Genomics Studies," PLOS Computational Biology, Public Library of Science, vol. 8(1), pages 1-9, January.
    21. Jin Hyun Ju & Sushila A Shenoy & Ronald G Crystal & Jason G Mezey, 2017. "An independent component analysis confounding factor correction framework for identifying broad impact expression quantitative trait loci," PLOS Computational Biology, Public Library of Science, vol. 13(5), pages 1-26, May.
    22. Miecznikowski, Jeffrey C. & Gold, David & Shepherd, Lori & Liu, Song, 2011. "Deriving and comparing the distribution for the number of false positives in single step methods to control k-FWER," Statistics & Probability Letters, Elsevier, vol. 81(11), pages 1695-1705, November.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1010184. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.