IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v96y2016icp145-158.html
   My bibliography  Save this article

Graph-theoretic multisample tests of equality in distribution for high dimensional data

Author

Listed:
  • Petrie, Adam

Abstract

Testing whether two or more independent samples arise from a common distribution is a classic problem in statistics. Several multivariate two-sample tests of equality are based on graphs such as the minimum spanning tree, nearest neighbor, and optimal nonbipartite perfect matching. Here, the samples are pooled and the test statistic is the number of edges in the graph that connect points with different sample identities. These tests are typically unbiased and perform well when estimates of underlying probability densities are poor. However, these tests have not been thoroughly studied when data is very high dimensional or in the multisample case. We introduce the use of orthogonal perfect matchings for testing equality in distribution. A suite of Monte Carlo simulations on artificial and real data shows that orthogonal perfect matchings and spanning trees typically have higher power than other graphs and are also more effective at discerning when samples have differences in their covariance structure compared to other nonparametric tests such as the energy and triangle tests.

Suggested Citation

  • Petrie, Adam, 2016. "Graph-theoretic multisample tests of equality in distribution for high dimensional data," Computational Statistics & Data Analysis, Elsevier, vol. 96(C), pages 145-158.
  • Handle: RePEc:eee:csdana:v:96:y:2016:i:c:p:145-158
    DOI: 10.1016/j.csda.2015.11.003
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947315002716
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2015.11.003?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Lu, Bo & Greevy, Robert & Xu, Xinyi & Beck, Cole, 2011. "Optimal Nonbipartite Matching and Its Statistical Applications," The American Statistician, American Statistical Association, vol. 65(1), pages 21-30.
    2. Paul R. Rosenbaum, 2005. "An exact distribution‐free test comparing two multivariate distributions based on adjacency," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(4), pages 515-530, September.
    3. Justel, Ana & Peña, Daniel & Zamar, Rubén, 1997. "A multivariate Kolmogorov-Smirnov test of goodness of fit," Statistics & Probability Letters, Elsevier, vol. 35(3), pages 251-259, October.
    4. Baringhaus, L. & Franz, C., 2004. "On a new multivariate two-sample test," Journal of Multivariate Analysis, Elsevier, vol. 88(1), pages 190-206, January.
    5. Dinh Pham & Joachim Möcks & Lothar Sroka, 1989. "Asymptotic normality of double-indexed linear permutation statistics," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 41(3), pages 415-427, September.
    6. Zhenyu Liu & Reza Modarres, 2011. "A triangle test for equality of distribution functions in high dimensions," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 23(3), pages 605-615.
    7. Dale L. Zimmerman, 1993. "A Bivariate Cramér–Von Mises Type of Test for Spatial Randomness," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 42(1), pages 43-54, March.
    8. Nettleton, Dan & Banerjee, T., 2001. "Testing the equality of distributions of random vectors with categorical components," Computational Statistics & Data Analysis, Elsevier, vol. 37(2), pages 195-208, August.
    9. Rousson, Valentin, 2002. "On Distribution-Free Tests for the Multivariate Two-Sample Location-Scale Model," Journal of Multivariate Analysis, Elsevier, vol. 80(1), pages 43-57, January.
    10. Anderson, N. H. & Hall, P. & Titterington, D. M., 1994. "Two-Sample Test Statistics for Measuring Discrepancies Between Two Multivariate Probability Density Functions Using Kernel-Based Density Estimates," Journal of Multivariate Analysis, Elsevier, vol. 50(1), pages 41-54, July.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Luai Al-Labadi & Forough Fazeli Asl & Zahra Saberi, 2022. "A Bayesian nonparametric multi-sample test in any dimension," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 106(2), pages 217-242, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Biswas, Munmun & Ghosh, Anil K., 2014. "A nonparametric two-sample test applicable to high dimensional data," Journal of Multivariate Analysis, Elsevier, vol. 123(C), pages 160-171.
    2. Modarres, Reza, 2014. "On the interpoint distances of Bernoulli vectors," Statistics & Probability Letters, Elsevier, vol. 84(C), pages 215-222.
    3. Shin-ichi Tsukada, 2019. "High dimensional two-sample test based on the inter-point distance," Computational Statistics, Springer, vol. 34(2), pages 599-615, June.
    4. Mondal, Pronoy K. & Biswas, Munmun & Ghosh, Anil K., 2015. "On high dimensional two-sample tests based on nearest neighbors," Journal of Multivariate Analysis, Elsevier, vol. 141(C), pages 168-178.
    5. Paul, Biplab & De, Shyamal K. & Ghosh, Anil K., 2022. "Some clustering-based exact distribution-free k-sample tests applicable to high dimension, low sample size data," Journal of Multivariate Analysis, Elsevier, vol. 190(C).
    6. Anil K. Ghosh & Munmun Biswas, 2016. "Distribution-free high-dimensional two-sample tests based on discriminating hyperplanes," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(3), pages 525-547, September.
    7. Reza Modarres, 2020. "Graphical Comparison of High‐Dimensional Distributions," International Statistical Review, International Statistical Institute, vol. 88(3), pages 698-714, December.
    8. Petrie, Adam & Willemain, Thomas R., 2013. "An empirical study of tests for uniformity in multidimensional data," Computational Statistics & Data Analysis, Elsevier, vol. 64(C), pages 253-268.
    9. Lovato, Ilenia & Pini, Alessia & Stamm, Aymeric & Vantini, Simone, 2020. "Model-free two-sample test for network-valued data," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    10. Carole Bernard & Oleg Bondarenko & Steven Vanduffel, 2021. "A model-free approach to multivariate option pricing," Review of Derivatives Research, Springer, vol. 24(2), pages 135-155, July.
    11. Jie Shi & Arno P. J. M. Siebes & Siamak Mehrkanoon, 2023. "TransCORALNet: A Two-Stream Transformer CORAL Networks for Supply Chain Credit Assessment Cold Start," Papers 2311.18749, arXiv.org.
    12. Jean-David Fermanian & Dominique Guégan, 2021. "Fair learning with bagging," Documents de travail du Centre d'Economie de la Sorbonne 21034, Université Panthéon-Sorbonne (Paris 1), Centre d'Economie de la Sorbonne.
    13. Martin L. Hazelton & Tilman M. Davies, 2022. "Pointwise comparison of two multivariate density functions," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 49(4), pages 1791-1810, December.
    14. Masayuki Hirukawa & Mari Sakudo, 2016. "Testing Symmetry of Unknown Densities via Smoothing with the Generalized Gamma Kernels," Econometrics, MDPI, vol. 4(2), pages 1-27, June.
    15. Masato Okamoto, 2009. "Decomposition of gini and multivariate gini indices," The Journal of Economic Inequality, Springer;Society for the Study of Economic Inequality, vol. 7(2), pages 153-177, June.
    16. Heinrich Lothar & Klein Stella, 2011. "Central limit theorem for the integrated squared error of the empirical second-order product density and goodness-of-fit tests for stationary point processes," Statistics & Risk Modeling, De Gruyter, vol. 28(4), pages 359-387, December.
    17. M. D. Jiménez-Gamero & M. Cousido-Rocha & M. V. Alba-Fernández & F. Jiménez-Jiménez, 2022. "Testing the equality of a large number of populations," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 31(1), pages 1-21, March.
    18. M. D. Jiménez-Gamero & J. L. Moreno-Rebollo & J. A. Mayor-Gallego, 2018. "On the estimation of the characteristic function in finite populations with applications," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 27(1), pages 95-121, March.
    19. Caiya Zhang & Zhengyan Lin & Jianjun Wu, 2009. "Nonparametric tests for the general multivariate multi-sample problem," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 21(7), pages 877-888.
    20. Marcelo Fernandes & Eduardo Mendes & Olivier Scaillet, 2015. "Testing for symmetry and conditional symmetry using asymmetric kernels," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 67(4), pages 649-671, August.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:96:y:2016:i:c:p:145-158. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.