IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v76y2020i2p508-517.html
   My bibliography  Save this article

Data reduction prior to inference: Are there consequences of comparing groups using a t‐test based on principal component scores?

Author

Listed:
  • Edward J. Bedrick

Abstract

Researchers often use a two‐step process to analyze multivariate data. First, dimensionality is reduced using a technique such as principal component analysis, followed by a group comparison using a t‐test or analysis of variance. Although this practice is often discouraged, the statistical properties of this procedure are not well understood, starting with the hypothesis being tested. We suggest that this approach might be considering two distinct hypotheses, one of which is a global test of no differences in the mean vectors, and the other being a focused test of a specific linear combination where the coefficients have been estimated from the data. We study the asymptotic properties of the two‐sample t‐statistic for these two scenarios, assuming a nonsparse setting. We show that the size of the global test agrees with the presumed level but that the test has poor power. In contrast, the size of the focused test can be arbitrarily distorted with certain mean and covariance structures. A simple method is provided to correct the size of the focused test. Data analyses and simulations are used to illustrate the results. Recommendations on the use of this two‐step method and the related use of principal components for prediction are provided.

Suggested Citation

  • Edward J. Bedrick, 2020. "Data reduction prior to inference: Are there consequences of comparing groups using a t‐test based on principal component scores?," Biometrics, The International Biometric Society, vol. 76(2), pages 508-517, June.
  • Handle: RePEc:bla:biomet:v:76:y:2020:i:2:p:508-517
    DOI: 10.1111/biom.13159
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13159
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13159?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Wei‐Chien Chang, 1983. "On Using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 32(3), pages 267-275, November.
    2. Shelley Edwards & Bieke Vanhooydonck & Anthony Herrel & G John Measey & Krystal A Tolley, 2012. "Convergent Evolution Associated with Habitat Decouples Phenotype from Phylogeny in a Clade of Lizards," PLOS ONE, Public Library of Science, vol. 7(12), pages 1-9, December.
    3. Roger S. Zoh & Abhra Sarkar & Raymond J. Carroll & Bani K. Mallick, 2018. "A Powerful Bayesian Test for Equality of Means in High Dimensions," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(524), pages 1733-1741, October.
    4. Kollo, T. & Neudecker, H., 1993. "Asymptotics of Eigenvalues and Unit-Length Eigenvectors of Sample Variance and Correlation Matrices," Journal of Multivariate Analysis, Elsevier, vol. 47(2), pages 283-300, November.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Wang, Zihan & Daeipour, Mohamad & Xu, Hongyi, 2023. "Quantification and propagation of Aleatoric uncertainties in topological structures," Reliability Engineering and System Safety, Elsevier, vol. 233(C).
    2. Ahlquist, John S. & Breunig, Christian, 2009. "Country clustering in comparative political economy," MPIfG Discussion Paper 09/5, Max Planck Institute for the Study of Societies.
    3. McLachlan, G. J. & Peel, D. & Bean, R. W., 2003. "Modelling high-dimensional data by mixtures of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 41(3-4), pages 379-388, January.
    4. Liu, Shuangzhe & Leiva, Víctor & Zhuang, Dan & Ma, Tiefeng & Figueroa-Zúñiga, Jorge I., 2022. "Matrix differential calculus with applications in the multivariate linear model and its diagnostics," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    5. Dirk Depril & Iven Mechelen & Tom Wilderjans, 2012. "Lowdimensional Additive Overlapping Clustering," Journal of Classification, Springer;The Classification Society, vol. 29(3), pages 297-320, October.
    6. Michael C. Thrun & Alfred Ultsch, 2021. "Using Projection-Based Clustering to Find Distance- and Density-Based Clusters in High-Dimensional Data," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 280-312, July.
    7. Andrews, Jeffrey L., 2018. "Addressing overfitting and underfitting in Gaussian model-based clustering," Computational Statistics & Data Analysis, Elsevier, vol. 127(C), pages 160-171.
    8. Douglas Steinley & Lawrence Hubert, 2008. "Order-Constrained Solutions in K-Means Clustering: Even Better Than Being Globally Optimal," Psychometrika, Springer;The Psychometric Society, vol. 73(4), pages 647-664, December.
    9. Bouveyron, Charles & Brunet, Camille, 2012. "Theoretical and practical considerations on the convergence properties of the Fisher-EM algorithm," Journal of Multivariate Analysis, Elsevier, vol. 109(C), pages 29-41.
    10. Floriello, Davide & Vitelli, Valeria, 2017. "Sparse clustering of functional data," Journal of Multivariate Analysis, Elsevier, vol. 154(C), pages 1-18.
    11. Bauer, Jan O. & Drabant, Bernhard, 2021. "Principal loading analysis," Journal of Multivariate Analysis, Elsevier, vol. 184(C).
    12. repec:jss:jstsof:47:i05 is not listed on IDEAS
    13. Boik, Robert J., 2013. "Model-based principal components of correlation matrices," Journal of Multivariate Analysis, Elsevier, vol. 116(C), pages 310-331.
    14. Bauer, Jan O. & Drabant, Bernhard, 2023. "Regression based thresholds in principal loading analysis," Journal of Multivariate Analysis, Elsevier, vol. 193(C).
    15. Neudecker, Heinz & Satorra, Albert, 1996. "The algebraic equality of two asymptotic tests for the hypothesis that a normal distribution has a specified correlation matrix," Statistics & Probability Letters, Elsevier, vol. 30(2), pages 99-103, October.
    16. Steland, Ansgar & von Sachs, Rainer, 2018. "Asymptotics for high-dimensional covariance matrices and quadratic forms with applications to the trace functional and shrinkage," Stochastic Processes and their Applications, Elsevier, vol. 128(8), pages 2816-2855.
    17. Yoshikazu Terada, 2015. "Strong consistency of factorial $$K$$ K -means clustering," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 67(2), pages 335-357, April.
    18. Yuanyuan Jiang & Xingzhong Xu, 2022. "A Two-Sample Test of High Dimensional Means Based on Posterior Bayes Factor," Mathematics, MDPI, vol. 10(10), pages 1-23, May.
    19. Marc Hallin & Davy Paindaveine & Thomas Verdebout, 2009. "Optimal rank-based testing for principal component," Working Papers ECARES 2009_013, ULB -- Universite Libre de Bruxelles.
    20. Aaron Fisher & Brian Caffo & Brian Schwartz & Vadim Zipunnikov, 2016. "Fast, Exact Bootstrap Principal Component Analysis for > 1 Million," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(514), pages 846-860, April.
    21. Shuangge Ma & Michael R. Kosorok & Jason P. Fine, 2006. "Additive Risk Models for Survival Data with High-Dimensional Covariates," Biometrics, The International Biometric Society, vol. 62(1), pages 202-210, March.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:76:y:2020:i:2:p:508-517. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.