IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1005515.html
   My bibliography  Save this article

Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates

Author

Listed:
  • Andreas Tuerk
  • Gregor Wiktorin
  • Serhat Güler

Abstract

Accuracy of transcript quantification with RNA-Seq is negatively affected by positional fragment bias. This article introduces Mix2 (rd. “mixquare”), a transcript quantification method which uses a mixture of probability distributions to model and thereby neutralize the effects of positional fragment bias. The parameters of Mix2 are trained by Expectation Maximization resulting in simultaneous transcript abundance and bias estimates. We compare Mix2 to Cufflinks, RSEM, eXpress and PennSeq; state-of-the-art quantification methods implementing some form of bias correction. On four synthetic biases we show that the accuracy of Mix2 overall exceeds the accuracy of the other methods and that its bias estimates converge to the correct solution. We further evaluate Mix2 on real RNA-Seq data from the Microarray and Sequencing Quality Control (MAQC, SEQC) Consortia. On MAQC data, Mix2 achieves improved correlation to qPCR measurements with a relative increase in R2 between 4% and 50%. Mix2 also yields repeatable concentration estimates across technical replicates with a relative increase in R2 between 8% and 47% and reduced standard deviation across the full concentration range. We further observe more accurate detection of differential expression with a relative increase in true positives between 74% and 378% for 5% false positives. In addition, Mix2 reveals 5 dominant biases in MAQC data deviating from the common assumption of a uniform fragment distribution. On SEQC data, Mix2 yields higher consistency between measured and predicted concentration ratios. A relative error of 20% or less is obtained for 51% of transcripts by Mix2, 40% of transcripts by Cufflinks and RSEM and 30% by eXpress. Titration order consistency is correct for 47% of transcripts for Mix2, 41% for Cufflinks and RSEM and 34% for eXpress. We, further, observe improved repeatability across laboratory sites with a relative increase in R2 between 8% and 44% and reduced standard deviation.Author summary: RNA-Seq is a powerful tool for detecting and quantifying genes and gene isoforms. However, accurate quantification in genomic loci with multiple isoforms has proven difficult. This is due to the fact that the transcript generating an RNA-Seq fragment cannot be identified if multiple transcripts share the fragment sequence. Due to this ambiguity, transcript concentration is usually determined in a statistical framework by calculating the probability that a transcript generates an RNA-Seq fragment. Accurate estimation of this probability requires an accurate model of the transcript specific distributions of RNA-Seq fragments. However, fragment distributions in statistical models of RNA-Seq data are usually over-simplified. This article introduces the Mix2 (rd. “mixquare”) model which uses mixtures of probability distributions to model the transcript specific positional fragment distributions. Mix2 learns the mixture weights and approximates therefore the fragment bias in RNA-Seq data. We compare Mix2 on artificial and real RNA-Seq data to four state-of-the-art quantification methods. Our experiments show that Mix2 yields more accurate and repeatable quantification estimates and that it leads to more accurate detection of differential expression. We further show that the biases detected by Mix2 contradict the common assumption of a uniform fragment distribution.

Suggested Citation

  • Andreas Tuerk & Gregor Wiktorin & Serhat Güler, 2017. "Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates," PLOS Computational Biology, Public Library of Science, vol. 13(5), pages 1-25, May.
  • Handle: RePEc:plo:pcbi00:1005515
    DOI: 10.1371/journal.pcbi.1005515
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005515
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005515&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1005515?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1005515. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.