IDEAS home Printed from https://ideas.repec.org/a/taf/japsta/v45y2018i14p2658-2676.html
   My bibliography  Save this article

Genomic feature selection by coverage design optimization

Author

Listed:
  • Stephen Reid
  • Aaron M. Newman
  • Maximilian Diehn
  • Ash A. Alizadeh
  • Robert Tibshirani

Abstract

We introduce a novel data reduction technique whereby we select a subset of tiles to ‘cover’ maximally events of interest in large-scale biological datasets (e.g. genetic mutations), while minimizing the number of tiles. A tile is a genomic unit capturing one or more biological events, such as a sequence of base pairs that can be sequenced and observed simultaneously. The goal is to reduce significantly the number of tiles considered to those with areas of dense events in a cohort, thus saving on cost and enhancing interpretability. However, the reduction should not come at the cost of too much information, allowing for sensible statistical analysis after its application. We envisage application of our methods to a variety of high throughput data types, particularly those produced by next-generation sequencing (NGS) experiments. The procedure is cast as a convex optimization problem, which is presented, along with methods of its solution. The method is demonstrated on a large dataset of somatic mutations spanning 5000+ patients, each having one of 29 cancer types. Applied to these data, our method dramatically reduces the number of gene locations required for broad coverage of patients and their mutations, giving subject specialists a more easily interpretable snapshot of recurrent mutational profiles in these cancers. The locations identified coincide with previously identified cancer genes. Finally, despite considerable data reduction, we show that our covering designs preserve the cancer discrimination ability of multinomial logistic regression models trained on all of the locations ( $ >1M $ >1M).

Suggested Citation

  • Stephen Reid & Aaron M. Newman & Maximilian Diehn & Ash A. Alizadeh & Robert Tibshirani, 2018. "Genomic feature selection by coverage design optimization," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(14), pages 2658-2676, October.
  • Handle: RePEc:taf:japsta:v:45:y:2018:i:14:p:2658-2676
    DOI: 10.1080/02664763.2018.1432577
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/02664763.2018.1432577
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1080/02664763.2018.1432577?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:japsta:v:45:y:2018:i:14:p:2658-2676. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Longhurst (email available below). General contact details of provider: http://www.tandfonline.com/CJAS20 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.