IDEAS home Printed from https://ideas.repec.org/a/eee/jmvana/v167y2018icp435-452.html
   My bibliography  Save this article

Asymptotic performance of PCA for high-dimensional heteroscedastic data

Author

Listed:
  • Hong, David
  • Balzano, Laura
  • Fessler, Jeffrey A.

Abstract

Principal Component Analysis (PCA) is a classical method for reducing the dimensionality of data by projecting them onto a subspace that captures most of their variation. Effective use of PCA in modern applications requires understanding its performance for data that are both high-dimensional and heteroscedastic. This paper analyzes the statistical performance of PCA in this setting, i.e., for high-dimensional data drawn from a low-dimensional subspace and degraded by heteroscedastic noise. We provide simplified expressions for the asymptotic PCA recovery of the underlying subspace, subspace amplitudes and subspace coefficients; the expressions enable both easy and efficient calculation and reasoning about the performance of PCA. We exploit the structure of these expressions to show that, for a fixed average noise variance, the asymptotic recovery of PCA for heteroscedastic data is always worse than that for homoscedastic data (i.e., for noise variances that are equal across samples). Hence, while average noise variance is often a practically convenient measure for the overall quality of data, it gives an overly optimistic estimate of the performance of PCA for heteroscedastic data.

Suggested Citation

  • Hong, David & Balzano, Laura & Fessler, Jeffrey A., 2018. "Asymptotic performance of PCA for high-dimensional heteroscedastic data," Journal of Multivariate Analysis, Elsevier, vol. 167(C), pages 435-452.
  • Handle: RePEc:eee:jmvana:v:167:y:2018:i:c:p:435-452
    DOI: 10.1016/j.jmva.2018.06.002
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0047259X17304852
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.jmva.2018.06.002?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Johnstone, Iain M. & Lu, Arthur Yu, 2009. "On Consistency and Sparsity for Principal Components Analysis in High Dimensions," Journal of the American Statistical Association, American Statistical Association, vol. 104(486), pages 682-693.
    2. Bai, Zhidong & Yao, Jianfeng, 2012. "On sample eigenvalues in a generalized spiked population model," Journal of Multivariate Analysis, Elsevier, vol. 106(C), pages 167-177.
    3. Michael E. Tipping & Christopher M. Bishop, 1999. "Probabilistic Principal Component Analysis," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 61(3), pages 611-622.
    4. Croux, Christophe & Ruiz-Gazen, Anne, 2005. "High breakdown estimators for principal components: the projection-pursuit approach revisited," Journal of Multivariate Analysis, Elsevier, vol. 95(1), pages 206-226, July.
    5. Benaych-Georges, Florent & Nadakuditi, Raj Rao, 2012. "The singular values and vectors of low rank perturbations of large rectangular random matrices," Journal of Multivariate Analysis, Elsevier, vol. 111(C), pages 120-135.
    6. Pan, Guangming, 2010. "Strong convergence of the empirical distribution of eigenvalues of sample covariance matrices with a perturbation matrix," Journal of Multivariate Analysis, Elsevier, vol. 101(6), pages 1330-1338, July.
    7. Jeffrey T. Leek, 2011. "Asymptotic Conditional Singular Value Decomposition for High-Dimensional Genomic Data," Biometrics, The International Biometric Society, vol. 67(2), pages 344-352, June.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Adar, Mustapha & Najih, Youssef & Gouskir, Mohamed & Chebak, Ahmed & Mabrouki, Mustapha & Bennouna, Amin, 2020. "Three PV plants performance analysis using the principal component analysis method," Energy, Elsevier, vol. 207(C).
    2. Leeb, William, 2021. "A note on identifiability conditions in confirmatory factor analysis," Statistics & Probability Letters, Elsevier, vol. 178(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Barigozzi, Matteo & Trapani, Lorenzo, 2020. "Sequential testing for structural stability in approximate factor models," Stochastic Processes and their Applications, Elsevier, vol. 130(8), pages 5149-5187.
    2. Xinyi Zhong & Chang Su & Zhou Fan, 2022. "Empirical Bayes PCA in high dimensions," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 84(3), pages 853-878, July.
    3. Wang, Shao-Hsuan & Huang, Su-Yun, 2022. "Perturbation theory for cross data matrix-based PCA," Journal of Multivariate Analysis, Elsevier, vol. 190(C).
    4. Benaych-Georges, Florent & Nadakuditi, Raj Rao, 2012. "The singular values and vectors of low rank perturbations of large rectangular random matrices," Journal of Multivariate Analysis, Elsevier, vol. 111(C), pages 120-135.
    5. Ding, Xiucai & Ji, Hong Chang, 2023. "Spiked multiplicative random matrices and principal components," Stochastic Processes and their Applications, Elsevier, vol. 163(C), pages 25-60.
    6. Landgraf, Andrew J. & Lee, Yoonkyung, 2020. "Dimensionality reduction for binary data through the projection of natural parameters," Journal of Multivariate Analysis, Elsevier, vol. 180(C).
    7. Anna Bykhovskaya & Vadim Gorin, 2023. "High-Dimensional Canonical Correlation Analysis," Papers 2306.16393, arXiv.org, revised Aug 2023.
    8. Dey, Rounak & Lee, Seunggeun, 2019. "Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model," Journal of Multivariate Analysis, Elsevier, vol. 173(C), pages 145-164.
    9. Yi-Hao Kao & Benjamin Van Roy, 2014. "Directed Principal Component Analysis," Operations Research, INFORMS, vol. 62(4), pages 957-972, August.
    10. Puyi Fang & Zhaoxing Gao & Ruey S. Tsay, 2023. "Determination of the effective cointegration rank in high-dimensional time-series predictive regressions," Papers 2304.12134, arXiv.org, revised Apr 2023.
    11. Wang, Zihan & Daeipour, Mohamad & Xu, Hongyi, 2023. "Quantification and propagation of Aleatoric uncertainties in topological structures," Reliability Engineering and System Safety, Elsevier, vol. 233(C).
    12. Candelon, B. & Hurlin, C. & Tokpavi, S., 2012. "Sampling error and double shrinkage estimation of minimum variance portfolios," Journal of Empirical Finance, Elsevier, vol. 19(4), pages 511-527.
    13. Chen, Jiaqi & Zhang, Yangchun & Li, Weiming & Tian, Boping, 2018. "A supplement on CLT for LSS under a large dimensional generalized spiked covariance model," Statistics & Probability Letters, Elsevier, vol. 138(C), pages 57-65.
    14. Fan, Jianqing & Jiang, Bai & Sun, Qiang, 2022. "Bayesian factor-adjusted sparse regression," Journal of Econometrics, Elsevier, vol. 230(1), pages 3-19.
    15. Xin Xu & Yang Lu & Yupeng Zhou & Zhiguo Fu & Yanjie Fu & Minghao Yin, 2021. "An Information-Explainable Random Walk Based Unsupervised Network Representation Learning Framework on Node Classification Tasks," Mathematics, MDPI, vol. 9(15), pages 1-14, July.
    16. Yata, Kazuyoshi & Aoshima, Makoto, 2013. "PCA consistency for the power spiked model in high-dimensional settings," Journal of Multivariate Analysis, Elsevier, vol. 122(C), pages 334-354.
    17. Asai, Manabu & McAleer, Michael, 2015. "Forecasting co-volatilities via factor models with asymmetry and long memory in realized covariance," Journal of Econometrics, Elsevier, vol. 189(2), pages 251-262.
    18. Dorota Toczydlowska & Gareth W. Peters & Man Chung Fung & Pavel V. Shevchenko, 2017. "Stochastic Period and Cohort Effect State-Space Mortality Models Incorporating Demographic Factors via Probabilistic Robust Principal Components," Risks, MDPI, vol. 5(3), pages 1-77, July.
    19. Matteo Barigozzi & Marc Hallin, 2023. "Dynamic Factor Models: a Genealogy," Papers 2310.17278, arXiv.org, revised Jan 2024.
    20. Jiménez Recaredo, Raúl José & Elías Fernández, Antonio, 2017. "Prediction Bands for Functional Data Based on Depth Measures," DES - Working Papers. Statistics and Econometrics. WS 24606, Universidad Carlos III de Madrid. Departamento de Estadística.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:jmvana:v:167:y:2018:i:c:p:435-452. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/wps/find/journaldescription.cws_home/622892/description#description .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.