Generating correlated data for omics simulation

Generating correlated data for omics simulation

Author

Listed:

Jianing Yang
Gregory R Grant
Thomas G Brooks

Abstract

Simulation of realistic omics data is a key input for benchmarking studies that help users obtain optimal computational pipelines. Omics data involves large numbers of measured features on each sample and these measures are generally correlated with each other. However, simulation too often ignores these correlations, perhaps due to computational and statistical hurdles of doing so. To alleviate this, we describe three approaches for generating omics-scale data with correlated measures which mimic real datasets. These approaches are all based on a Gaussian copula approach with a covariance matrix that decomposes into a diagonal part and a low-rank part. This decomposition allows for extremely efficient simulation, overcoming a hurdle for adoption of past methods. We use these approaches to demonstrate the importance of including correlation in two benchmarking applications. First, we show that variance of results from the popular DESeq2 method increases when dependence is included. Second, we demonstrate that CYCLOPS, a method for inferring circadian time of collection from transcriptomics, improves in performance when given gene-gene dependencies in some circumstances. We provide an R package, dependentsimr, that has efficient implementations of these methods and can generate dependent data with arbitrary marginal distributions, including discrete (binary, ordered categorical, Poisson, negative binomial), continuous (normal), or with an empirical distribution.Author summary: Modern techniques, including high-throughput sequencing, produce more data than ever before. To determine the optimal computational analysis methods for these data, benchmarks are often performed using simulated data. This simulated data needs to closely match realistic data in order for benchmarking to meaningful. An often neglected aspect of this is that measurements of different values are often correlated or dependent upon each other. Two possible reasons for this neglect could be that there is a lack of guidelines on how to produce such data and also that methods to produce it are computationally expensive to run. We describe here three related methods that are both conceptually relatively simple and also highly computationally efficient. We demonstrated these on two applications which show how inclusion of these dependencies can affect the results of benchmarking. Lastly, we provide a software package to act as a reference implementations of these.

Suggested Citation

Jianing Yang & Gregory R Grant & Thomas G Brooks, 2025. "Generating correlated data for omics simulation," PLOS Computational Biology, Public Library of Science, vol. 21(9), pages 1-16, September.

Handle: RePEc:plo:pcbi00:1013392
DOI: 10.1371/journal.pcbi.1013392

Download full text from publisher

References listed on IDEAS

Opgen-Rhein Rainer & Strimmer Korbinian, 2007. "Accurate Ranking of Differentially Expressed Genes by a Distribution-Free Shrinkage Approach," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 6(1), pages 1-20, February.
Schäfer Juliane & Strimmer Korbinian, 2005. "A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 4(1), pages 1-32, November.
Cai, Tony & Liu, Weidong, 2011. "Adaptive Thresholding for Sparse Covariance Matrix Estimation," Journal of the American Statistical Association, American Statistical Association, vol. 106(494), pages 672-684.

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Jianqing Fan & Xu Han, 2017. "Estimation of the false discovery proportion with unknown dependence," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(4), pages 1143-1164, September.
Seunghwan Lee & Sang Cheol Kim & Donghyeon Yu, 2023. "An efficient GPU-parallel coordinate descent algorithm for sparse precision matrix estimation via scaled lasso," Computational Statistics, Springer, vol. 38(1), pages 217-242, March.
Mr. Jorge A Chan-Lau, 2017. "Variance Decomposition Networks: Potential Pitfalls and a Simple Solution," IMF Working Papers 2017/107, International Monetary Fund.
Korbinian Strimmer, 2008. "Comments on: Augmenting the bootstrap to analyze high dimensional genomic data," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 17(1), pages 25-27, May.
Martin Bod’a, 2017. "Stochastic sensitivity analysis of concentration measures," Central European Journal of Operations Research, Springer;Slovak Society for Operations Research;Hungarian Operational Research Society;Czech Society for Operations Research;Österr. Gesellschaft für Operations Research (ÖGOR);Slovenian Society Informatika - Section for Operational Research;Croatian Operational Research Society, vol. 25(2), pages 441-471, June.
Sumanjay Dutta & Shashi Jain, 2023. "Precision versus Shrinkage: A Comparative Analysis of Covariance Estimation Methods for Portfolio Allocation," Papers 2305.11298, arXiv.org.
Lam, Clifford, 2020. "High-dimensional covariance matrix estimation," LSE Research Online Documents on Economics 101667, London School of Economics and Political Science, LSE Library.
Shen, Yanfeng & Lin, Zhengyan, 2015. "An adaptive test for the mean vector in large-p-small-n problems," Computational Statistics & Data Analysis, Elsevier, vol. 89(C), pages 25-38.
Gautam Sabnis & Debdeep Pati & Anirban Bhattacharya, 2019. "Compressed Covariance Estimation with Automated Dimension Learning," Sankhya A: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 81(2), pages 466-481, December.
Huang, Na & Fryzlewicz, Piotr, 2018. "NOVELIST estimator of large correlation and covariance matrices and their inverses," LSE Research Online Documents on Economics 89055, London School of Economics and Political Science, LSE Library.
Arnab Chakrabarti & Rituparna Sen, 2018. "Some Statistical Problems with High Dimensional Financial data," Papers 1808.02953, arXiv.org.
Bailey, Natalia & Pesaran, M. Hashem & Smith, L. Vanessa, 2019. "A multiple testing approach to the regularisation of large sample correlation matrices," Journal of Econometrics, Elsevier, vol. 208(2), pages 507-534.
- Natalia Bailey & Vanessa Smith & M. Hashem Pesaran, 2014. "A multiple testing approach to the regularisation of large sample correlation matrices," Cambridge Working Papers in Economics 1413, Faculty of Economics, University of Cambridge.
- Natalia Bailey & M. Hashem Pesaran & L. Vanessa Smith, 2015. "A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices," Working Papers 764, Queen Mary University of London, School of Economics and Finance.
- Natalia Bailey & M. Hashem Pesaran & L. Vanessa Smith, 2014. "A Multiple Testing Approach to the Regularisation of Large Sample Correlation Matrices," CESifo Working Paper Series 4834, CESifo.
Chen, Shuo & Kang, Jian & Xing, Yishi & Zhao, Yunpeng & Milton, Donald K., 2018. "Estimating large covariance matrix with network topology for high-dimensional biomedical data," Computational Statistics & Data Analysis, Elsevier, vol. 127(C), pages 82-95.
Lim Johan & Kim Jayoun & Kim Sang-cheol & Yu Donghyeon & Kim Kyunga & Kim Byung Soo, 2012. "Detection of Differentially Expressed Gene Sets in a Partially Paired Microarray Data Set," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(3), pages 1-30, February.
Luger, Richard, 2025. "Regularizing stock return covariance matrices via multiple testing of correlations," Journal of Econometrics, Elsevier, vol. 248(C).
- Richard Luger, 2024. "Regularizing stock return covariance matrices via multiple testing of correlations," Papers 2407.09696, arXiv.org.
Ikeda, Yuki & Kubokawa, Tatsuya & Srivastava, Muni S., 2016. "Comparison of linear shrinkage estimators of a large covariance matrix in normal and non-normal distributions," Computational Statistics & Data Analysis, Elsevier, vol. 95(C), pages 95-108.
Huiqin Xin & Sihai Dave Zhao, 2023. "A compound decision approach to covariance matrix estimation," Biometrics, The International Biometric Society, vol. 79(2), pages 1201-1212, June.
Vera Djordjilović & Monica Chiogna & Chiara Romualdi, 2020. "Simulating gene silencing through intervention analysis," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 69(4), pages 887-907, August.
Na Huang & Piotr Fryzlewicz, 2019. "NOVELIST estimator of large correlation and covariance matrices and their inverses," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(3), pages 694-727, September.
Hannart, Alexis & Naveau, Philippe, 2014. "Estimating high dimensional covariance matrices: A new look at the Gaussian conjugate framework," Journal of Multivariate Analysis, Elsevier, vol. 131(C), pages 149-162.

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1013392. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Generating correlated data for omics simulation

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data