IDEAS home Printed from https://ideas.repec.org/a/plo/pgen00/1011092.html
   My bibliography  Save this article

Improving population scale statistical phasing with whole-genome sequencing data

Author

Listed:
  • Rick Wertenbroek
  • Robin J Hofmeister
  • Ioannis Xenarios
  • Yann Thoma
  • Olivier Delaneau

Abstract

Haplotype estimation, or phasing, has gained significant traction in large-scale projects due to its valuable contributions to population genetics, variant analysis, and the creation of reference panels for imputation and phasing of new samples. To scale with the growing number of samples, haplotype estimation methods designed for population scale rely on highly optimized statistical models to phase genotype data, and usually ignore read-level information. Statistical methods excel in resolving common variants, however, they still struggle at rare variants due to the lack of statistical information. In this study we introduce SAPPHIRE, a new method that leverages whole-genome sequencing data to enhance the precision of haplotype calls produced by statistical phasing. SAPPHIRE achieves this by refining haplotype estimates through the realignment of sequencing reads, particularly targeting low-confidence phase calls. Our findings demonstrate that SAPPHIRE significantly enhances the accuracy of haplotypes obtained from state of the art methods and also provides the subset of phase calls that are validated by sequencing reads. Finally, we show that our method scales to large data sets by its successful application to the extensive 3.6 Petabytes of sequencing data of the last UK Biobank 200,031 sample release.Author summary: Haplotype estimation, also known as phasing, is now applied to population scale projects, typically of hundreds of thousands of samples to millions of samples. Generally phasing relies on statistical methods as they provide very accurate results for common variations. However, for rare and very rare variants the lack of statistical power often results in poor phasing. The large amount of rare variations discovered with whole-genome sequencing as well as the number of samples makes it expensive to process. We have developed the SAPPHIRE method that leverages whole-genome sequencing data to verify and correct the phase at poorly phased variant loci. It does so by finding sequencing reads that contain both the poorly phased variant and an accurately phased common variant. SAPPHIRE scales with large data sets by specifically targeting variation where statistical phasing performed poorly, therefore it reduces the quantity of sequencing data to be processed and combines the advantages of both read-based and statistical approaches. We show the efficiency of SAPPHIRE by improving the estimated haplotypes for 200,031 samples in the UK Biobank. SAPPHIRE is free and available as open-source software.

Suggested Citation

  • Rick Wertenbroek & Robin J Hofmeister & Ioannis Xenarios & Yann Thoma & Olivier Delaneau, 2024. "Improving population scale statistical phasing with whole-genome sequencing data," PLOS Genetics, Public Library of Science, vol. 20(7), pages 1-22, July.
  • Handle: RePEc:plo:pgen00:1011092
    DOI: 10.1371/journal.pgen.1011092
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1011092
    Download Restriction: no

    File URL: https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1011092&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pgen.1011092?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Bjarni V. Halldorsson & Hannes P. Eggertsson & Kristjan H. S. Moore & Hannes Hauswedell & Ogmundur Eiriksson & Magnus O. Ulfarsson & Gunnar Palsson & Marteinn T. Hardarson & Asmundur Oddsson & Brynjar, 2022. "The sequences of 150,119 genomes in the UK Biobank," Nature, Nature, vol. 607(7920), pages 732-740, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Shiyu Zhang & Zheng Wang & Yijing Wang & Yixiao Zhu & Qiao Zhou & Xingxing Jian & Guihu Zhao & Jian Qiu & Kun Xia & Beisha Tang & Julian Mutz & Jinchen Li & Bin Li, 2024. "A metabolomic profile of biological aging in 250,341 individuals from the UK Biobank," Nature Communications, Nature, vol. 15(1), pages 1-19, December.
    2. Aimee M. Deaton & Aditi Dubey & Lucas D. Ward & Peter Dornbos & Jason Flannick & Elaine Yee & Simina Ticau & Leila Noetzli & Margaret M. Parker & Rachel A. Hoffing & Carissa Willis & Mollie E. Plekan , 2022. "Rare loss of function variants in the hepatokine gene INHBE protect from abdominal obesity," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
    3. Katherine A. Kentistou & Brandon E. M. Lim & Lena R. Kaisinger & Valgerdur Steinthorsdottir & Luke N. Sharp & Kashyap A. Patel & Vinicius Tragante & Gareth Hawkes & Eugene J. Gardner & Thorhildur Olaf, 2025. "Rare variant associations with birth weight identify genes involved in adipose tissue regulation, placental function and insulin-like growth factor signalling," Nature Communications, Nature, vol. 16(1), pages 1-12, December.
    4. Saedis Saevarsdottir & Kristbjörg Bjarnadottir & Thorsteinn Markusson & Jonas Berglund & Thorunn A. Olafsdottir & Gisli H. Halldorsson & Gudrun Rutsdottir & Kristbjorg Gunnarsdottir & Asgeir Orn Arnth, 2024. "Start codon variant in LAG3 is associated with decreased LAG-3 expression and increased risk of autoimmune thyroid disease," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    5. Alexander L. Han & Chloe F. Sands & Dorota Matelska & Jessica C. Butts & Vida Ravanmehr & Fengyuan Hu & Esmeralda Villavicencio Gonzalez & Nicholas Katsanis & Carlos D. Bustamante & Quanli Wang & Slav, 2025. "Diverse ancestral representation improves genetic intolerance metrics," Nature Communications, Nature, vol. 16(1), pages 1-9, December.
    6. Scott D. Findlay & Lindsay Romo & Christopher B. Burge, 2024. "Quantifying negative selection in human 3ʹ UTRs uncovers constrained targets of RNA-binding proteins," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    7. Margaret Sunitha Selvaraj & Xihao Li & Zilin Li & Akhil Pampana & David Y. Zhang & Joseph Park & Stella Aslibekyan & Joshua C. Bis & Jennifer A. Brody & Brian E. Cade & Lee-Ming Chuang & Ren-Hua Chung, 2022. "Whole genome sequence analysis of blood lipid levels in >66,000 individuals," Nature Communications, Nature, vol. 13(1), pages 1-18, December.
    8. De-Min Duan & Chinyi Cheng & Yu-Shu Huang & An-ko Chung & Pin-Xuan Chen & Yu-An Chen & Jacob Shujui Hsu & Pei-Lung Chen, 2025. "Comparisons of performances of structural variants detection algorithms in solitary or combination strategy," PLOS ONE, Public Library of Science, vol. 20(2), pages 1-25, February.
    9. Andrea B. Jonsdottir & Gardar Sveinbjornsson & Rosa B. Thorolfsdottir & Max Tamlander & Vinicius Tragante & Thorhildur Olafsdottir & Solvi Rognvaldsson & Asgeir Sigurdsson & Hannes P. Eggertsson & Hil, 2025. "Missense variants in FRS3 affect body mass index in populations of diverse ancestries," Nature Communications, Nature, vol. 16(1), pages 1-16, December.
    10. Gareth Hawkes & Robin N. Beaumont & Zilin Li & Ravi Mandla & Xihao Li & Christine M. Albert & Donna K. Arnett & Allison E. Ashley-Koch & Aneel A. Ashrani & Kathleen C. Barnes & Eric Boerwinkle & Jenni, 2024. "Whole-genome sequencing in 333,100 individuals reveals rare non-coding single variant and aggregate associations with height," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    11. Benjamin M. Jacobs & Daniel Stow & Sam Hodgson & Julia Zöllner & Miriam Samuel & Stavroula Kanoni & Saeed Bidi & Klaudia Walter & Claudia Langenberg & Ruth Dobson & Sarah Finer & Caroline Morton & Mon, 2024. "Genetic architecture of routinely acquired blood tests in a British South Asian cohort," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    12. Gudmundur Einarsson & Gudmar Thorleifsson & Valgerdur Steinthorsdottir & Florian Zink & Hannes Helgason & Thorhildur Olafsdottir & Solvi Rognvaldsson & Vinicius Tragante & Magnus O. Ulfarsson & Gardar, 2024. "Sequence variants associated with BMI affect disease risk through BMI itself," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    13. Alexander T. Williams & Jing Chen & Kayesha Coley & Chiara Batini & Abril Izquierdo & Richard Packer & Erik Abner & Stavroula Kanoni & David J. Shepherd & Robert C. Free & Edward J. Hollox & Nigel J. , 2023. "Genome-wide association study of thyroid-stimulating hormone highlights new genes, pathways and associations with thyroid disease," Nature Communications, Nature, vol. 14(1), pages 1-14, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pgen00:1011092. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosgenetics (email available below). General contact details of provider: https://journals.plos.org/plosgenetics/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.