IDEAS home Printed from https://ideas.repec.org/a/gam/jstats/v8y2025i3p78-d1734074.html
   My bibliography  Save this article

On Synthetic Interval Data with Predetermined Subject Partitioning and Partial Control of the Variables’ Marginal Correlation Structure

Author

Listed:
  • Michail Papathomas

    (School of Mathematics and Statistics, University of St. Andrews, St. Andrews KY16 9AJ, UK
    Current address: The Observatory, Buchanan Gardens, St. Andrews KY16 9LZ, UK.)

Abstract

A standard approach for assessing the performance of partition models is to create synthetic datasets with a prespecified clustering structure and assess how well the model reveals this structure. A common format involves subjects being assigned to different clusters, with observations simulated so that subjects within the same cluster have similar profiles, allowing for some variability. In this manuscript, we consider observations from interval variables. Interval data are commonly observed in cohort and Genome-Wide Association studies, and our focus is on Single-Nucleotide Polymorphisms. Theoretical and empirical results are utilized to explore the dependence structure between the variables in relation to the clustering structure for the subjects. A novel algorithm is proposed that allows control over the marginal stratified correlation structure of the variables, specifying exact correlation values within groups of variables. Practical examples are shown, and a synthetic dataset is compared to a real one, to demonstrate similarities and differences.

Suggested Citation

  • Michail Papathomas, 2025. "On Synthetic Interval Data with Predetermined Subject Partitioning and Partial Control of the Variables’ Marginal Correlation Structure," Stats, MDPI, vol. 8(3), pages 1-18, August.
  • Handle: RePEc:gam:jstats:v:8:y:2025:i:3:p:78-:d:1734074
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2571-905X/8/3/78/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2571-905X/8/3/78/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Dunson, David B. & Xing, Chuanhua, 2009. "Nonparametric Bayes Modeling of Multivariate Categorical Data," Journal of the American Statistical Association, American Statistical Association, vol. 104(487), pages 1042-1051.
    2. Rayjean J. Hung & James D. McKay & Valerie Gaborieau & Paolo Boffetta & Mia Hashibe & David Zaridze & Anush Mukeria & Neonilia Szeszenia-Dabrowska & Jolanta Lissowska & Peter Rudnai & Eleonora Fabiano, 2008. "A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25," Nature, Nature, vol. 452(7187), pages 633-637, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kunihama, T. & Herring, A.H. & Halpern, C.T. & Dunson, D.B., 2016. "Nonparametric Bayes modeling with sample survey weights," Statistics & Probability Letters, Elsevier, vol. 113(C), pages 41-48.
    2. Mahsa Samsami & Ralf Wagner, 2021. "Investment Decisions with Endogeneity: A Dirichlet Tree Analysis," JRFM, MDPI, vol. 14(7), pages 1-19, July.
    3. Guanhua Fang & Jingchen Liu & Zhiliang Ying, 2019. "On the Identifiability of Diagnostic Classification Models," Psychometrika, Springer;The Psychometric Society, vol. 84(1), pages 19-40, March.
    4. Durante, Daniele, 2017. "A note on the multiplicative gamma process," Statistics & Probability Letters, Elsevier, vol. 122(C), pages 198-204.
    5. Yajuan Si & Jerome P. Reiter, 2013. "Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys," Journal of Educational and Behavioral Statistics, , vol. 38(5), pages 499-521, October.
    6. Hiroyuki Kasahara & Katsumi Shimotsu, 2014. "Non-parametric identification and estimation of the number of components in multivariate mixtures," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 76(1), pages 97-111, January.
    7. Jianxin Shi & Kouya Shiraishi & Jiyeon Choi & Keitaro Matsuo & Tzu-Yu Chen & Juncheng Dai & Rayjean J. Hung & Kexin Chen & Xiao-Ou Shu & Young Tae Kim & Maria Teresa Landi & Dongxin Lin & Wei Zheng & , 2023. "Genome-wide association study of lung adenocarcinoma in East Asia and comparison with a European population," Nature Communications, Nature, vol. 14(1), pages 1-17, December.
    8. Planas Christophe & Rossi Alessandro, 2024. "The slice sampler and centrally symmetric distributions," Monte Carlo Methods and Applications, De Gruyter, vol. 30(3), pages 299-313.
    9. Alfò, Marco & Rocchetti, Irene, 2013. "A flexible approach to finite mixture regression models for multivariate mixed responses," Statistics & Probability Letters, Elsevier, vol. 83(7), pages 1754-1758.
    10. James Jackson & Robin Mitra & Brian Francis & Iain Dove, 2022. "Using saturated count models for user‐friendly synthesis of large confidential administrative databases," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 185(4), pages 1613-1643, October.
    11. repec:plo:pone00:0010858 is not listed on IDEAS
    12. Humera Razzak & Christian Heumann, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Polish Statistical Association, vol. 20(4), pages 33-58, December.
    13. Razzak Humera & Heumann Christian, 2019. "Hybrid Multiple Imputation In A Large Scale Complex Survey," Statistics in Transition New Series, Statistics Poland, vol. 20(4), pages 33-58, December.
    14. repec:plo:pone00:0016981 is not listed on IDEAS
    15. Jing Zhou & Anirban Bhattacharya & Amy H. Herring & David B. Dunson, 2015. "Bayesian Factorizations of Big Sparse Tensors," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(512), pages 1562-1576, December.
    16. Krzysztof Chmielowiec & Jolanta Chmielowiec & Aleksandra Strońska-Pluta & Grzegorz Trybek & Małgorzata Śmiarowska & Aleksandra Suchanecka & Grzegorz Woźniak & Aleksandra Jaroń & Anna Grzywacz, 2022. "Association of Polymorphism CHRNA5 and CHRNA3 Gene in People Addicted to Nicotine," IJERPH, MDPI, vol. 19(17), pages 1-12, August.
    17. Hang J. Kim & Jörg Drechsler & Katherine J. Thompson, 2021. "Synthetic microdata for establishment surveys under informative sampling," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(1), pages 255-281, January.
    18. Tsuyoshi Kunihama & David B. Dunson, 2013. "Bayesian Modeling of Temporal Dependence in Large Sparse Contingency Tables," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(504), pages 1324-1338, December.
    19. Congcong Chen & Yang Li & Yayun Gu & Qiqi Zhai & Songwei Guo & Jun Xiang & Yuan Xie & Mingxing An & Chenmeijie Li & Na Qin & Yanan Shi & Liu Yang & Jun Zhou & Xianfeng Xu & Ziye Xu & Kai Wang & Meng Z, 2025. "Massively parallel variant-to-function mapping determines functional regulatory variants of non-small cell lung cancer," Nature Communications, Nature, vol. 16(1), pages 1-16, December.
    20. Zhenke Wu & Livia Casciola‐Rosen & Antony Rosen & Scott L. Zeger, 2021. "A Bayesian approach to restricted latent class models for scientifically structured clustering of multivariate binary outcomes," Biometrics, The International Biometric Society, vol. 77(4), pages 1431-1444, December.
    21. Paiva Thais & Reiter Jerome P., 2017. "Stop or Continue Data Collection: A Nonignorable Missing Data Approach for Continuous Variables," Journal of Official Statistics, Sciendo, vol. 33(3), pages 579-599, September.
    22. Daniel Manrique‐Vallier & Jingchen Hu, 2018. "Bayesian non‐parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(3), pages 635-647, June.

    More about this item

    Keywords

    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jstats:v:8:y:2025:i:3:p:78-:d:1734074. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.