IDEAS home Printed from https://ideas.repec.org/p/upf/upfgen/1856.html
   My bibliography  Save this paper

Principal component analysis

Author

Listed:
  • Michael Greenacre
  • Patrick J. F Groenen
  • Trevor Hastie
  • Alfonso Iodice d’Enza
  • Angelos Markos
  • Elena Tuzhilina

Abstract

Principal component analysis is a versatile statistical method for reducing a cases-byvariables data table to its essential features, called principal components. Principal components are a few linear combinations of the original variables that maximally explain the variance of all the variables. In the process, the method provides an approximation of the original data table using only these few major components. In this review we present a comprehensive review of the method's definition and geometry, as well as the interpretation of its numerical and graphical results. The main graphical result is often in the form of a biplot, using the major components to map the cases and adding the original variables to support the distance interpretation of the cases' positions. Variants of the method are also treated, such as the analysis of grouped data as well as the analysis of categorical data, known as correspondence analysis. We also describe and illustrate the latest innovative applications of principal component analysis: its use for estimating missing values in huge data matrices, sparse component estimation, and the analysis of images, shapes and functions. Supplementary material includes video animations and computer scripts in the R environment.

Suggested Citation

  • Michael Greenacre & Patrick J. F Groenen & Trevor Hastie & Alfonso Iodice d’Enza & Angelos Markos & Elena Tuzhilina, 2023. "Principal component analysis," Economics Working Papers 1856, Department of Economics and Business, Universitat Pompeu Fabra.
  • Handle: RePEc:upf:upfgen:1856
    as

    Download full text from publisher

    File URL: https://econ-papers.upf.edu/papers/1856.pdf
    File Function: Whole Paper
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Booysen, Frikkie & van der Berg, Servaas & Burger, Ronelle & Maltitz, Michael von & Rand, Gideon du, 2008. "Using an Asset Index to Assess Trends in Poverty in Seven Sub-Saharan African Countries," World Development, Elsevier, vol. 36(6), pages 1113-1130, June.
    2. Michael Greenacre, 2019. "Use of Correspondence Analysis in Clustering a Mixed-Scale Data Set with Missing Data," Economics Working Papers 1626, Department of Economics and Business, Universitat Pompeu Fabra.
    3. Nick Patterson & Alkes L Price & David Reich, 2006. "Population Structure and Eigenanalysis," PLOS Genetics, Public Library of Science, vol. 2(12), pages 1-20, December.
    4. Shen, Haipeng & Huang, Jianhua Z., 2008. "Sparse principal component analysis via regularized low rank matrix approximation," Journal of Multivariate Analysis, Elsevier, vol. 99(6), pages 1015-1034, July.
    5. N. Locantore & J. Marron & D. Simpson & N. Tripoli & J. Zhang & K. Cohen & Graciela Boente & Ricardo Fraiman & Babette Brumback & Christophe Croux & Jianqing Fan & Alois Kneip & John Marden & Daniel P, 1999. "Robust principal component analysis for functional data," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 8(1), pages 1-73, June.
    6. Lê, Sébastien & Josse, Julie & Husson, François, 2008. "FactoMineR: An R Package for Multivariate Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i01).
    7. Bongiorno, Enea G. & Goia, Aldo, 2019. "Describing the concentration of income populations by functional principal component analysis on Lorenz curves," Journal of Multivariate Analysis, Elsevier, vol. 170(C), pages 10-24.
    8. Song, Jun & Li, Bing, 2021. "Nonlinear and additive principal component analysis for functional data," Journal of Multivariate Analysis, Elsevier, vol. 181(C).
    9. Mahsa Ghorbani & Edwin K P Chong, 2020. "Stock price prediction using principal components," PLOS ONE, Public Library of Science, vol. 15(3), pages 1-20, March.
    10. J. Le-Rademacher & L. Billard, 2017. "Principal component analysis for histogram-valued data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 11(2), pages 327-351, June.
    11. Ocaña, F. A. & Aguilera, A. M. & Valderrama, M. J., 1999. "Functional Principal Components Analysis by Choice of Norm," Journal of Multivariate Analysis, Elsevier, vol. 71(2), pages 262-276, November.
    12. Alexander G. Ioannidis & Javier Blanco-Portillo & Karla Sandoval & Erika Hagelberg & Carmina Barberena-Jonas & Adrian V. S. Hill & Juan Esteban Rodríguez-Rodríguez & Keolu Fox & Kathryn Robson & Sonia, 2021. "Paths and timings of the peopling of Polynesia inferred from genomic networks," Nature, Nature, vol. 597(7877), pages 522-526, September.
    13. Matteo Mazziotta & Adriano Pareto, 2019. "Use and Misuse of PCA for Measuring Well-Being," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 142(2), pages 451-476, April.
    14. Davide Risso & Fanny Perraudeau & Svetlana Gribkova & Sandrine Dudoit & Jean-Philippe Vert, 2018. "A general and flexible method for signal extraction from single-cell RNA-seq data," Nature Communications, Nature, vol. 9(1), pages 1-17, December.
    15. Yao, Fang & Muller, Hans-Georg & Wang, Jane-Ling, 2005. "Functional Data Analysis for Sparse Longitudinal Data," Journal of the American Statistical Association, American Statistical Association, vol. 100, pages 577-590, June.
    16. Giordani, Paolo & Kiers, Henk A.L., 2006. "A comparison of three methods for principal component analysis of fuzzy interval data," Computational Statistics & Data Analysis, Elsevier, vol. 51(1), pages 379-397, November.
    17. Richards, Greg & van der Ark, L. Andries, 2013. "Dimensions of cultural consumption among tourists: Multiple correspondence analysis," Tourism Management, Elsevier, vol. 37(C), pages 71-76.
    18. Michael Greenacre & Rafael Pardo, 2006. "Subset Correspondence Analysis," Sociological Methods & Research, , vol. 35(2), pages 193-218, November.
    19. Charles R. Harris & K. Jarrod Millman & Stéfan J. Walt & Ralf Gommers & Pauli Virtanen & David Cournapeau & Eric Wieser & Julian Taylor & Sebastian Berg & Nathaniel J. Smith & Robert Kern & Matti Picu, 2020. "Array programming with NumPy," Nature, Nature, vol. 585(7825), pages 357-362, September.
    20. Dejian Lai, 2003. "Principal Component Analysis on Human Development Indicators of China," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 61(3), pages 319-330, March.
    21. Guerra Urzola, Rosember & Van Deun, Katrijn & Vera, J. C. & Sijtsma, K., 2021. "A guide for sparse PCA : Model comparison and applications," Other publications TiSEM 4d35b931-7f49-444b-b92f-a, Tilburg University, School of Economics and Management.
    22. Rosember Guerra-Urzola & Katrijn Van Deun & Juan C. Vera & Klaas Sijtsma, 2021. "A Guide for Sparse PCA: Model Comparison and Applications," Psychometrika, Springer;The Psychometric Society, vol. 86(4), pages 893-919, December.
    23. J. Gower, 1975. "Generalized procrustes analysis," Psychometrika, Springer;The Psychometric Society, vol. 40(1), pages 33-51, March.
    24. Arnold Wollenberg, 1977. "Redundancy analysis an alternative for canonical correlation analysis," Psychometrika, Springer;The Psychometric Society, vol. 42(2), pages 207-219, June.
    25. Siegfried Hörmann & Łukasz Kidziński & Marc Hallin, 2015. "Dynamic functional principal components," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 77(2), pages 319-348, March.
    26. Peres-Neto, Pedro R. & Jackson, Donald A. & Somers, Keith M., 2005. "How many principal components? stopping rules for determining the number of non-trivial axes revisited," Computational Statistics & Data Analysis, Elsevier, vol. 49(4), pages 974-997, June.
    27. Jaya Krishnakumar & A. Nagar, 2008. "On Exact Statistical Properties of Multidimensional Indices Based on Principal Components, Factor Analysis, MIMIC and Structural Equation Models," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 86(3), pages 481-496, May.
    28. Gad Abraham & Michael Inouye, 2014. "Fast Principal Component Analysis of Large-Scale Genome-Wide Data," PLOS ONE, Public Library of Science, vol. 9(4), pages 1-5, April.
    29. JOURNEE, Michel & NESTEROV, Yurii & RICHTARIK, Peter & SEPULCHRE, Rodolphe, 2010. "Generalized power method for sparse principal component analysis," LIDAM Reprints CORE 2232, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    30. John Aitchison & Michael Greenacre, 2002. "Biplots of compositional data," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 51(4), pages 375-392, October.
    31. Carl Eckart & Gale Young, 1936. "The approximation of one matrix by another of lower rank," Psychometrika, Springer;The Psychometric Society, vol. 1(3), pages 211-218, September.
    32. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    33. Hervé Cardot & David Degras, 2018. "Online Principal Component Analysis in High Dimension: Which Algorithm to Choose?," International Statistical Review, International Statistical Institute, vol. 86(1), pages 29-50, April.
    34. Josse, Julie & Husson, François, 2016. "missMDA: A Package for Handling Missing Values in Multivariate Data Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 70(i01).
    35. Josse, Julie & Husson, François, 2012. "Selecting the number of components in principal component analysis using cross-validation approximations," Computational Statistics & Data Analysis, Elsevier, vol. 56(6), pages 1869-1879.
    36. Federica Gioia & Carlo Lauro, 2006. "Principal component analysis on interval data," Computational Statistics, Springer, vol. 21(2), pages 343-363, June.
    37. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    38. Frederik Booysen, 2002. "An Overview and Evaluation of Composite Indices of Development," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 59(2), pages 115-151, August.
    39. Li, Yingxing & Huang, Chen & Härdle, Wolfgang K., 2019. "Spatial functional principal component analysis with applications to brain image data," Journal of Multivariate Analysis, Elsevier, vol. 170(C), pages 263-274.
    40. Sun Makosso-Kallyth & Edwin Diday, 2012. "Adaptation of interval PCA to symbolic histogram variables," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 6(2), pages 147-159, July.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Kin Sibanda & Alungile Qoko & Dorcas Gonese, 2024. "Health Expenditure, Institutional Quality, and Under-Five Mortality in Sub-Saharan African Countries," IJERPH, MDPI, vol. 21(3), pages 1-23, March.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Rosember Guerra-Urzola & Niek C. Schipper & Anya Tonne & Klaas Sijtsma & Juan C. Vera & Katrijn Deun, 2023. "Sparsifying the least-squares approach to PCA: comparison of lasso and cardinality constraint," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(1), pages 269-286, March.
    2. Nerea González-García & Ana Belén Nieto-Librero & Purificación Galindo-Villardón, 2023. "CenetBiplot: a new proposal of sparse and orthogonal biplots methods by means of elastic net CSVD," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(1), pages 5-19, March.
    3. Rosember Guerra-Urzola & Katrijn Van Deun & Juan C. Vera & Klaas Sijtsma, 2021. "A Guide for Sparse PCA: Model Comparison and Applications," Psychometrika, Springer;The Psychometric Society, vol. 86(4), pages 893-919, December.
    4. Guerra Urzola, Rosember & Van Deun, Katrijn & Vera, J. C. & Sijtsma, K., 2021. "A guide for sparse PCA : Model comparison and applications," Other publications TiSEM 4d35b931-7f49-444b-b92f-a, Tilburg University, School of Economics and Management.
    5. Kim, Hyun Hak & Swanson, Norman R., 2018. "Mining big data using parsimonious factor, machine learning, variable selection and shrinkage methods," International Journal of Forecasting, Elsevier, vol. 34(2), pages 339-354.
    6. Mihee Lee & Haipeng Shen & Jianhua Z. Huang & J. S. Marron, 2010. "Biclustering via Sparse Singular Value Decomposition," Biometrics, The International Biometric Society, vol. 66(4), pages 1087-1095, December.
    7. Amir Beck & Yakov Vaisbourd, 2016. "The Sparse Principal Component Analysis Problem: Optimality Conditions and Algorithms," Journal of Optimization Theory and Applications, Springer, vol. 170(1), pages 119-143, July.
    8. Maria Iannario & Rosaria Romano & Domenico Vistocco, 2023. "Dyadic analysis for multi-block data in sport surveys analytics," Annals of Operations Research, Springer, vol. 325(1), pages 701-714, June.
    9. Ali Mahzarnia & Jun Song, 2022. "Multivariate functional group sparse regression: Functional predictor selection," PLOS ONE, Public Library of Science, vol. 17(4), pages 1-22, April.
    10. Jolliffe, Ian, 2022. "A 50-year personal journey through time with principal component analysis," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    11. Mitzi Cubilla-Montilla & Ana Belén Nieto-Librero & M. Purificación Galindo-Villardón & Carlos A. Torres-Cubilla, 2021. "Sparse HJ Biplot: A New Methodology via Elastic Net," Mathematics, MDPI, vol. 9(11), pages 1-15, June.
    12. Thomas Despois & Catherine Doz, 2023. "Identifying and interpreting the factors in factor models via sparsity: Different approaches," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 38(4), pages 533-555, June.
    13. Jin-Xing Liu & Yong Xu & Chun-Hou Zheng & Yi Wang & Jing-Yu Yang, 2012. "Characteristic Gene Selection via Weighting Principal Components by Singular Values," PLOS ONE, Public Library of Science, vol. 7(7), pages 1-10, July.
    14. Paolo Fornaro & Henri Luomaranta, 2020. "Nowcasting Finnish real economic activity: a machine learning approach," Empirical Economics, Springer, vol. 58(1), pages 55-71, January.
    15. Seán Schmitz & Sophia Becker & Laura Weiand & Norman Niehoff & Frank Schwartzbach & Erika von Schneidemesser, 2019. "Determinants of Public Acceptance for Traffic-Reducing Policies to Improve Urban Air Quality," Sustainability, MDPI, vol. 11(14), pages 1-16, July.
    16. Merola, Giovanni Maria & Chen, Gemai, 2019. "Projection sparse principal component analysis: An efficient least squares method," Journal of Multivariate Analysis, Elsevier, vol. 173(C), pages 366-382.
    17. Anshul Verma & Orazio Angelini & Tiziana Di Matteo, 2019. "A new set of cluster driven composite development indicators," Papers 1911.11226, arXiv.org, revised Mar 2020.
    18. Cuadras, Carles M. & Greenacre, Michael, 2022. "A short history of statistical association: From correlation to correspondence analysis to copulas," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    19. Fang, Xiaolei & Paynabar, Kamran & Gebraeel, Nagi, 2017. "Multistream sensor fusion-based prognostics model for systems with single failure modes," Reliability Engineering and System Safety, Elsevier, vol. 159(C), pages 322-331.
    20. Harold A. Hernández-Roig & M. Carmen Aguilera-Morillo & Rosa E. Lillo, 2021. "Functional Modeling of High-Dimensional Data: A Manifold Learning Approach," Mathematics, MDPI, vol. 9(4), pages 1-22, February.

    More about this item

    JEL classification:

    • C19 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Other
    • C88 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other Computer Software

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:upf:upfgen:1856. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: the person in charge (email available below). General contact details of provider: http://www.econ.upf.edu/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.