Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates

Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates

Author

Listed:

Gary L Gadbury
Qinfang Xiang
Lin Yang
Stephen Barnes
Grier P Page
David B Allison

Abstract

Plasmode is a term coined several years ago to describe data sets that are derived from real data but for which some truth is known. Omic techniques, most especially microarray and genomewide association studies, have catalyzed a new zeitgeist of data sharing that is making data and data sets publicly available on an unprecedented scale. Coupling such data resources with a science of plasmode use would allow statistical methodologists to vet proposed techniques empirically (as opposed to only theoretically) and with data that are by definition realistic and representative. We illustrate the technique of empirical statistics by consideration of a common task when analyzing high dimensional data: the simultaneous testing of hundreds or thousands of hypotheses to determine which, if any, show statistical significance warranting follow-on research. The now-common practice of multiple testing in high dimensional experiment (HDE) settings has generated new methods for detecting statistically significant results. Although such methods have heretofore been subject to comparative performance analysis using simulated data, simulating data that realistically reflect data from an actual HDE remains a challenge. We describe a simulation procedure using actual data from an HDE where some truth regarding parameters of interest is known. We use the procedure to compare estimates for the proportion of true null hypotheses, the false discovery rate (FDR), and a local version of FDR obtained from 15 different statistical methods.Author Summary: Plasmode is a term used to describe a data set that has been derived from real data but for which some truth is known. Statistical methods that analyze data from high dimensional experiments (HDEs) seek to estimate quantities that are of interest to scientists, such as mean differences in gene expression levels and false discovery rates. The ability of statistical methods to accurately estimate these quantities depends on theoretical derivations or computer simulations. In computer simulations, data for which the true value of a quantity is known are often simulated from statistical models, and the ability of a statistical method to estimate this quantity is evaluated on the simulated data. However, in HDEs there are many possible statistical models to use, and which models appropriately produce data that reflect properties of real data is an open question. We propose the use of plasmodes as one answer to this question. If done carefully, plasmodes can produce data that reflect reality while maintaining the benefits of simulated data. We show one method of generating plasmodes and illustrate their use by comparing the performance of 15 statistical methods for estimating the false discovery rate in data from an HDE.

Suggested Citation

Gary L Gadbury & Qinfang Xiang & Lin Yang & Stephen Barnes & Grier P Page & David B Allison, 2008. "Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates," PLOS Genetics, Public Library of Science, vol. 4(6), pages 1-8, June.

Handle: RePEc:plo:pgen00:1000098
DOI: 10.1371/journal.pgen.1000098

Download full text from publisher

References listed on IDEAS

Efron, Bradley, 2004. "Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis," Journal of the American Statistical Association, American Statistical Association, vol. 99, pages 96-104, January.

Full references (including those not matched with items on IDEAS)

Citations

Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

Cited by:

Franklin, Jessica M. & Schneeweiss, Sebastian & Polinski, Jennifer M. & Rassen, Jeremy A., 2014. "Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases," Computational Statistics & Data Analysis, Elsevier, vol. 72(C), pages 219-226.

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Pounds Stanley B. & Gao Cuilan L. & Zhang Hui, 2012. "Empirical Bayesian Selection of Hypothesis Testing Procedures for Analysis of Sequence Count Expression Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(5), pages 1-32, October.
Shigeyuki Matsui & Hisashi Noma, 2011. "Estimating Effect Sizes of Differentially Expressed Genes for Power and Sample-Size Assessments in Microarray Experiments," Biometrics, The International Biometric Society, vol. 67(4), pages 1225-1235, December.
Won, Joong-Ho & Lim, Johan & Yu, Donghyeon & Kim, Byung Soo & Kim, Kyunga, 2014. "Monotone false discovery rate," Statistics & Probability Letters, Elsevier, vol. 87(C), pages 86-93.
van Wieringen, Wessel N. & Stam, Koen A. & Peeters, Carel F.W. & van de Wiel, Mark A., 2020. "Updating of the Gaussian graphical model through targeted penalized estimation," Journal of Multivariate Analysis, Elsevier, vol. 178(C).
Ian W. McKeague & Min Qian, 2015. "An Adaptive Resampling Test for Detecting the Presence of Significant Predictors," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(512), pages 1422-1433, December.
Angela Schörgendorfer & Adam J. Branscum & Timothy E. Hanson, 2013. "A Bayesian Goodness of Fit Test and Semiparametric Generalization of Logistic Regression with Measurement Data," Biometrics, The International Biometric Society, vol. 69(2), pages 508-519, June.
Han, Bing & Dalal, Siddhartha R., 2012. "A Bernstein-type estimator for decreasing density with application to p-value adjustments," Computational Statistics & Data Analysis, Elsevier, vol. 56(2), pages 427-437.
Dalia Valencia & Rosa E. Lillo & Juan Romo, 2019. "A Kendall correlation coefficient between functional data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(4), pages 1083-1103, December.
Kline, Patrick & Walters, Christopher, 2019. "Audits as Evidence: Experiments, Ensembles, and Enforcement," Institute for Research on Labor and Employment, Working Paper Series qt3z72m9kn, Institute of Industrial Relations, UC Berkeley.
- Patrick Kline & Christopher Walters, 2019. "Audits as Evidence: Experiments, Ensembles, and Enforcement," Papers 1907.06622, arXiv.org, revised Jul 2019.
He, Yi & Pan, Wei & Lin, Jizhen, 2006. "Cluster analysis using multivariate normal mixture models to detect differential gene expression with microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 51(2), pages 641-658, November.
Cheng, Cheng, 2009. "Internal validation inferences of significant genomic features in genome-wide screening," Computational Statistics & Data Analysis, Elsevier, vol. 53(3), pages 788-800, January.
Sinjini Sikdar & Somnath Datta & Susmita Datta, 2017. "EAMA: Empirically adjusted meta-analysis for large-scale simultaneous hypothesis testing in genomic experiments," PLOS ONE, Public Library of Science, vol. 12(10), pages 1-19, October.
Tianwei Yu, 2018. "A new dynamic correlation algorithm reveals novel functional aspects in single cell and bulk RNA-seq data," PLOS Computational Biology, Public Library of Science, vol. 14(8), pages 1-22, August.
Xiang, Qinfang & Edwards, Jode & Gadbury, Gary L., 2006. "Interval estimation in a finite mixture model: Modeling P-values in multiple testing applications," Computational Statistics & Data Analysis, Elsevier, vol. 51(2), pages 570-586, November.
Gordon, Alexander & Chen, Linlin & Glazko, Galina & Yakovlev, Andrei, 2009. "Balancing type one and two errors in multiple testing for differential expression of genes," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1622-1629, March.
Ruggieri, Eric & Lawrence, Charles E., 2012. "On efficient calculations for Bayesian variable selection," Computational Statistics & Data Analysis, Elsevier, vol. 56(6), pages 1319-1332.
Montazeri Zahra & Yanofsky Corey M. & Bickel David R., 2010. "Shrinkage Estimation of Effect Sizes as an Alternative to Hypothesis Testing Followed by Estimation in High-Dimensional Biology: Applications to Differential Gene Expression," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 9(1), pages 1-33, June.
T. Tony Cai & Wenguang Sun, 2017. "Optimal screening and discovery of sparse signals with applications to multistage high throughput studies," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(1), pages 197-223, January.
Yi-Hui Zhou & Paul Brooks & Xiaoshan Wang, 2018. "A Two-Stage Hidden Markov Model Design for Biomarker Detection, with Application to Microbiome Research," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 10(1), pages 41-58, April.
Woo, Chi-Keung & Horowitz, Ira & Olson, Arne & Horii, Brian & Baskette, Carmen, 2006. "Efficient frontiers for electricity procurement by an LDC with multiple purchase options," Omega, Elsevier, vol. 34(1), pages 70-80, January.

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pgen00:1000098. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosgenetics (email available below). General contact details of provider: https://journals.plos.org/plosgenetics/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Citations

Most related items

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data