IDEAS home Printed from https://ideas.repec.org/a/plo/pgen00/1000098.html

Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates

Author

Listed:
  • Gary L Gadbury
  • Qinfang Xiang
  • Lin Yang
  • Stephen Barnes
  • Grier P Page
  • David B Allison

Abstract

Plasmode is a term coined several years ago to describe data sets that are derived from real data but for which some truth is known. Omic techniques, most especially microarray and genomewide association studies, have catalyzed a new zeitgeist of data sharing that is making data and data sets publicly available on an unprecedented scale. Coupling such data resources with a science of plasmode use would allow statistical methodologists to vet proposed techniques empirically (as opposed to only theoretically) and with data that are by definition realistic and representative. We illustrate the technique of empirical statistics by consideration of a common task when analyzing high dimensional data: the simultaneous testing of hundreds or thousands of hypotheses to determine which, if any, show statistical significance warranting follow-on research. The now-common practice of multiple testing in high dimensional experiment (HDE) settings has generated new methods for detecting statistically significant results. Although such methods have heretofore been subject to comparative performance analysis using simulated data, simulating data that realistically reflect data from an actual HDE remains a challenge. We describe a simulation procedure using actual data from an HDE where some truth regarding parameters of interest is known. We use the procedure to compare estimates for the proportion of true null hypotheses, the false discovery rate (FDR), and a local version of FDR obtained from 15 different statistical methods.Author Summary: Plasmode is a term used to describe a data set that has been derived from real data but for which some truth is known. Statistical methods that analyze data from high dimensional experiments (HDEs) seek to estimate quantities that are of interest to scientists, such as mean differences in gene expression levels and false discovery rates. The ability of statistical methods to accurately estimate these quantities depends on theoretical derivations or computer simulations. In computer simulations, data for which the true value of a quantity is known are often simulated from statistical models, and the ability of a statistical method to estimate this quantity is evaluated on the simulated data. However, in HDEs there are many possible statistical models to use, and which models appropriately produce data that reflect properties of real data is an open question. We propose the use of plasmodes as one answer to this question. If done carefully, plasmodes can produce data that reflect reality while maintaining the benefits of simulated data. We show one method of generating plasmodes and illustrate their use by comparing the performance of 15 statistical methods for estimating the false discovery rate in data from an HDE.

Suggested Citation

  • Gary L Gadbury & Qinfang Xiang & Lin Yang & Stephen Barnes & Grier P Page & David B Allison, 2008. "Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates," PLOS Genetics, Public Library of Science, vol. 4(6), pages 1-8, June.
  • Handle: RePEc:plo:pgen00:1000098
    DOI: 10.1371/journal.pgen.1000098
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000098
    Download Restriction: no

    File URL: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000098&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pgen.1000098?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Efron, Bradley, 2004. "Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis," Journal of the American Statistical Association, American Statistical Association, vol. 99, pages 96-104, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Franklin, Jessica M. & Schneeweiss, Sebastian & Polinski, Jennifer M. & Rassen, Jeremy A., 2014. "Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases," Computational Statistics & Data Analysis, Elsevier, vol. 72(C), pages 219-226.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Pounds Stanley B. & Gao Cuilan L. & Zhang Hui, 2012. "Empirical Bayesian Selection of Hypothesis Testing Procedures for Analysis of Sequence Count Expression Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(5), pages 1-32, October.
    2. Hai Shu & Bin Nan & Robert Koeppe, 2015. "Multiple testing for neuroimaging via hidden Markov random field," Biometrics, The International Biometric Society, vol. 71(3), pages 741-750, September.
    3. Yong Wang, 2009. "The constrained Fisher scoring method for maximum likelihood computation of a nonparametric mixing distribution," Computational Statistics, Springer, vol. 24(1), pages 67-81, February.
    4. Bilgrau, Anders Ellern & Eriksen, Poul Svante & Rasmussen, Jakob Gulddahl & Johnsen, Hans Erik & Dybkaer, Karen & Boegsted, Martin, 2016. "GMCM: Unsupervised Clustering and Meta-Analysis Using Gaussian Mixture Copula Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 70(i02).
    5. Campbell R. Harvey & Yan Liu & Heqing Zhu, 2014. ". . . and the Cross-Section of Expected Returns," NBER Working Papers 20592, National Bureau of Economic Research, Inc.
    6. Shigeyuki Matsui & Hisashi Noma, 2011. "Estimating Effect Sizes of Differentially Expressed Genes for Power and Sample-Size Assessments in Microarray Experiments," Biometrics, The International Biometric Society, vol. 67(4), pages 1225-1235, December.
    7. Patrick Kline & Christopher Walters, 2019. "Audits as Evidence: Experiments, Ensembles, and Enforcement," Papers 1907.06622, arXiv.org, revised Jul 2019.
    8. Raphael Gottardo & Wei Li & W. Evan Johnson & X. Shirley Liu, 2008. "A Flexible and Powerful Bayesian Hierarchical Model for ChIP–Chip Experiments," Biometrics, The International Biometric Society, vol. 64(2), pages 468-478, June.
    9. Sairam Rayaprolu & Zhiyi Chi, 2021. "False Discovery Variance Reduction in Large Scale Simultaneous Hypothesis Tests," Methodology and Computing in Applied Probability, Springer, vol. 23(3), pages 711-733, September.
    10. Won, Joong-Ho & Lim, Johan & Yu, Donghyeon & Kim, Byung Soo & Kim, Kyunga, 2014. "Monotone false discovery rate," Statistics & Probability Letters, Elsevier, vol. 87(C), pages 86-93.
    11. David R. Bickel, 2014. "Small-scale Inference: Empirical Bayes and Confidence Methods for as Few as a Single Comparison," International Statistical Review, International Statistical Institute, vol. 82(3), pages 457-476, December.
    12. Pan, Lanfeng & Li, Yehua & He, Kevin & Li, Yanming & Li, Yi, 2020. "Generalized linear mixed models with Gaussian mixture random effects: Inference and application," Journal of Multivariate Analysis, Elsevier, vol. 175(C).
    13. van Wieringen, Wessel N. & Stam, Koen A. & Peeters, Carel F.W. & van de Wiel, Mark A., 2020. "Updating of the Gaussian graphical model through targeted penalized estimation," Journal of Multivariate Analysis, Elsevier, vol. 178(C).
    14. Ian W. McKeague & Min Qian, 2015. "An Adaptive Resampling Test for Detecting the Presence of Significant Predictors," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(512), pages 1422-1433, December.
    15. Hefei Zhang & Xuhang Li & Dongyuan Song & Onur Yukselen & Shivani Nanda & Alper Kucukural & Jingyi Jessica Li & Manuel Garber & Albertha J. M. Walhout, 2025. "Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq," Nature Communications, Nature, vol. 16(1), pages 1-21, December.
    16. Angela Schörgendorfer & Adam J. Branscum & Timothy E. Hanson, 2013. "A Bayesian Goodness of Fit Test and Semiparametric Generalization of Logistic Regression with Measurement Data," Biometrics, The International Biometric Society, vol. 69(2), pages 508-519, June.
    17. Zhao, Haibing & Fung, Wing Kam, 2016. "A powerful FDR control procedure for multiple hypotheses," Computational Statistics & Data Analysis, Elsevier, vol. 98(C), pages 60-70.
    18. T. Tony Cai & Wenguang Sun & Weinan Wang, 2019. "Covariate‐assisted ranking and screening for large‐scale two‐sample inference," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 81(2), pages 187-234, April.
    19. Hong, Zhaoping & Lian, Heng, 2012. "BOPA: A Bayesian hierarchical model for outlier expression detection," Computational Statistics & Data Analysis, Elsevier, vol. 56(12), pages 4146-4156.
    20. Marot Guillemette & Mayer Claus-Dieter, 2009. "Sequential Analysis for Microarray Data Based on Sensitivity and Meta-Analysis," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 8(1), pages 1-35, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pgen00:1000098. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosgenetics (email available below). General contact details of provider: https://journals.plos.org/plosgenetics/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.