IDEAS home Printed from https://ideas.repec.org/a/taf/jnlasa/v110y2015i511p1239-1247.html
   My bibliography  Save this article

Tracking Cross-Validated Estimates of Prediction Error as Studies Accumulate

Author

Listed:
  • Lo-Bin Chang
  • Donald Geman

Abstract

In recent years, "reproducibility" has emerged as a key factor in evaluating x applications of statistics to the biomedical sciences, for example, learning predictors of disease phenotypes from high-throughput "omics" data. In particular, "validation" is undermined when error rates on newly acquired data are sharply higher than those originally reported. More precisely, when data are collected from m "studies" representing possibly different subphenotypes, more generally different mixtures of subphenotypes, the error rates in cross-study validation (CSV) are observed to be larger than those obtained in ordinary randomized cross-validation (RCV), although the "gap" seems to close as m increases. Whereas these findings are hardly surprising for a heterogenous underlying population, this discrepancy is then seen as a barrier to translational research. We provide a statistical formulation in the large-sample limit: studies themselves are modeled as components of a mixture and all error rates are optimal (Bayes) for a two-class problem. Our results cohere with the trends observed in practice and suggest what is likely to be observed with large samples and consistent density estimators, namely, that the CSV error rate exceeds the RCV error rates for any m , the latter (appropriately averaged) increases with m , and both converge to the optimal rate for the whole population.

Suggested Citation

  • Lo-Bin Chang & Donald Geman, 2015. "Tracking Cross-Validated Estimates of Prediction Error as Studies Accumulate," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(511), pages 1239-1247, September.
  • Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1239-1247
    DOI: 10.1080/01621459.2014.1002926
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/01621459.2014.1002926
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1080/01621459.2014.1002926?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1239-1247. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Longhurst (email available below). General contact details of provider: http://www.tandfonline.com/UASA20 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.