IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1004842.html
   My bibliography  Save this article

Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes

Author

Listed:
  • Jerome Kelleher
  • Alison M Etheridge
  • Gilean McVean

Abstract

A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.Author Summary: Our understanding of the distribution of genetic variation in natural populations has been driven by mathematical models of the underlying biological and demographic processes. A key strength of such coalescent models is that they enable efficient simulation of data we might see under a variety of evolutionary scenarios. However, current methods are not well suited to simulating genome-scale data sets on hundreds of thousands of samples, which is essential if we are to understand the data generated by population-scale sequencing projects. Similarly, processing the results of large simulations also presents researchers with a major challenge, as it can take many days just to read the data files. In this paper we solve these problems by introducing a new way to represent information about the ancestral process. This new representation leads to huge gains in simulation speed and storage efficiency so that large simulations complete in minutes and the output files can be processed in seconds.

Suggested Citation

  • Jerome Kelleher & Alison M Etheridge & Gilean McVean, 2016. "Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes," PLOS Computational Biology, Public Library of Science, vol. 12(5), pages 1-22, May.
  • Handle: RePEc:plo:pcbi00:1004842
    DOI: 10.1371/journal.pcbi.1004842
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004842
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1004842&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1004842?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Kirk E. Lohmueller & Amit R. Indap & Steffen Schmidt & Adam R. Boyko & Ryan D. Hernandez & Melissa J. Hubisz & John J. Sninsky & Thomas J. White & Shamil R. Sunyaev & Rasmus Nielsen & Andrew G. Clark , 2008. "Proportionally more deleterious genetic variation in European than in African populations," Nature, Nature, vol. 451(7181), pages 994-997, February.
    2. Heng Li & Richard Durbin, 2011. "Inference of human population history from individual whole-genome sequences," Nature, Nature, vol. 475(7357), pages 493-496, July.
    3. Eriksson, A. & Mahjani, B. & Mehlig, B., 2009. "Sequential Markov coalescent algorithms for population models with demographic structure," Theoretical Population Biology, Elsevier, vol. 76(2), pages 84-91.
    4. John Novembre & Toby Johnson & Katarzyna Bryc & Zoltán Kutalik & Adam R. Boyko & Adam Auton & Amit Indap & Karen S. King & Sven Bergmann & Matthew R. Nelson & Matthew Stephens & Carlos D. Bustamante, 2008. "Genes mirror geography within Europe," Nature, Nature, vol. 456(7219), pages 274-274, November.
    5. John Novembre & Toby Johnson & Katarzyna Bryc & Zoltán Kutalik & Adam R. Boyko & Adam Auton & Amit Indap & Karen S. King & Sven Bergmann & Matthew R. Nelson & Matthew Stephens & Carlos D. Bustamante, 2008. "Genes mirror geography within Europe," Nature, Nature, vol. 456(7218), pages 98-101, November.
    6. Michael Eisenstein, 2015. "Big data: The power of petabytes," Nature, Nature, vol. 527(7576), pages 2-4, November.
    7. Haipeng Li & Thomas Wiehe, 2013. "Coalescent Tree Imbalance and a Simple Test for Selective Sweeps Based on Microsatellite Variation," PLOS Computational Biology, Public Library of Science, vol. 9(5), pages 1-14, May.
    8. Matthew D Rasmussen & Melissa J Hubisz & Ilan Gronau & Adam Siepel, 2014. "Genome-Wide Inference of Ancestral Recombination Graphs," PLOS Genetics, Public Library of Science, vol. 10(5), pages 1-27, May.
    9. Barton, N.H. & Etheridge, A.M. & Kelleher, J. & Véber, A., 2013. "Inference in two dimensions: Allele frequencies versus lengths of shared sequence blocks," Theoretical Population Biology, Elsevier, vol. 87(C), pages 105-119.
    10. Daniel John Lawson & Garrett Hellenthal & Simon Myers & Daniel Falush, 2012. "Inference of Population Structure using Dense Haplotype Data," PLOS Genetics, Public Library of Science, vol. 8(1), pages 1-16, January.
    11. Kelleher, J. & Etheridge, A.M. & Barton, N.H., 2014. "Coalescent simulation in continuous space: Algorithms for large neighbourhood size," Theoretical Population Biology, Elsevier, vol. 95(C), pages 13-23.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Sam Tallman & Maria das Dores Sungo & Sílvio Saranga & Sandra Beleza, 2023. "Whole genomes from Angola and Mozambique inform about the origins and dispersals of major African migrations," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    2. Victoria L. Sork & Shawn J. Cokus & Sorel T. Fitz-Gibbon & Aleksey V. Zimin & Daniela Puiu & Jesse A. Garcia & Paul F. Gugger & Claudia L. Henriquez & Ying Zhen & Kirk E. Lohmueller & Matteo Pellegrin, 2022. "High-quality genome and methylomes illustrate features underlying evolutionary success of oaks," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    3. Michael DeGiorgio & Zachary A Szpiech, 2022. "A spatially aware likelihood test to detect sweeps from haplotype distributions," PLOS Genetics, Public Library of Science, vol. 18(4), pages 1-37, April.
    4. Ali Mahmoudi & Jere Koskela & Jerome Kelleher & Yao-ban Chan & David Balding, 2022. "Bayesian inference of ancestral recombination graphs," PLOS Computational Biology, Public Library of Science, vol. 18(3), pages 1-15, March.
    5. Parul Johri & Wolfgang Stephan & Jeffrey D Jensen, 2022. "Soft selective sweeps: Addressing new definitions, evaluating competing models, and interpreting empirical outliers," PLOS Genetics, Public Library of Science, vol. 18(2), pages 1-12, February.
    6. Sergio F. Nigenda-Morales & Meixi Lin & Paulina G. Nuñez-Valencia & Christopher C. Kyriazis & Annabel C. Beichman & Jacqueline A. Robinson & Aaron P. Ragsdale & Jorge Urbán R. & Frederick I. Archer & , 2023. "The genomic footprint of whaling and isolation in fin whale populations," Nature Communications, Nature, vol. 14(1), pages 1-18, December.
    7. Brieuc Lehmann & Maxine Mackintosh & Gil McVean & Chris Holmes, 2023. "Optimal strategies for learning multi-ancestry polygenic scores vary across traits," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    8. Ralph, Peter L., 2019. "An empirical approach to demographic inference with genomic data," Theoretical Population Biology, Elsevier, vol. 127(C), pages 91-101.
    9. Kerdoncuff, Elise & Lambert, Amaury & Achaz, Guillaume, 2020. "Testing for population decline using maximal linkage disequilibrium blocks," Theoretical Population Biology, Elsevier, vol. 134(C), pages 171-181.
    10. Jerome Kelleher & Kevin R Thornton & Jaime Ashander & Peter L Ralph, 2018. "Efficient pedigree recording for fast population genetics simulation," PLOS Computational Biology, Public Library of Science, vol. 14(11), pages 1-21, November.
    11. Max Lundberg & Alexander Mackintosh & Anna Petri & Staffan Bensch, 2023. "Inversions maintain differences between migratory phenotypes of a songbird," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    12. Deng, Yun & Song, Yun S. & Nielsen, Rasmus, 2021. "The distribution of waiting distances in ancestral recombination graphs," Theoretical Population Biology, Elsevier, vol. 141(C), pages 34-43.
    13. Zihao Wang & Wenxi Wang & Xiaoming Xie & Yongfa Wang & Zhengzhao Yang & Huiru Peng & Mingming Xin & Yingyin Yao & Zhaorong Hu & Jie Liu & Zhenqi Su & Chaojie Xie & Baoyun Li & Zhongfu Ni & Qixin Sun &, 2022. "Dispersed emergence and protracted domestication of polyploid wheat uncovered by mosaic ancestral haploblock inference," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    14. Simone Rubinacci & Olivier Delaneau & Jonathan Marchini, 2020. "Genotype imputation using the Positional Burrows Wheeler Transform," PLOS Genetics, Public Library of Science, vol. 16(11), pages 1-19, November.
    15. Andrea Fulgione & Célia Neto & Ahmed F. Elfarargi & Emmanuel Tergemina & Shifa Ansari & Mehmet Göktay & Herculano Dinis & Nina Döring & Pádraic J. Flood & Sofia Rodriguez-Pacheco & Nora Walden & Marcu, 2022. "Parallel reduction in flowering time from de novo mutations enable evolutionary rescue in colonizing lineages," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    16. Vasili Pankratov & Milyausha Yunusbaeva & Sergei Ryakhovsky & Maksym Zarodniuk & Bayazit Yunusbayev, 2022. "Prioritizing autoimmunity risk variants for functional analyses by fine-mapping mutations under natural selection," Nature Communications, Nature, vol. 13(1), pages 1-13, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mateus H. Gouveia & Amy R. Bentley & Thiago P. Leal & Eduardo Tarazona-Santos & Carlos D. Bustamante & Adebowale A. Adeyemo & Charles N. Rotimi & Daniel Shriner, 2023. "Unappreciated subcontinental admixture in Europeans and European Americans and implications for genetic epidemiology studies," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    2. Oscar Lao & Fan Liu & Andreas Wollstein & Manfred Kayser, 2014. "GAGA: A New Algorithm for Genomic Inference of Geographic Ancestry Reveals Fine Level Population Substructure in Europeans," PLOS Computational Biology, Public Library of Science, vol. 10(2), pages 1-11, February.
    3. Guindon, Stéphane & Guo, Hongbin & Welch, David, 2016. "Demographic inference under the coalescent in a spatial continuum," Theoretical Population Biology, Elsevier, vol. 111(C), pages 43-50.
    4. Gideon S Bradburd & Peter L Ralph & Graham M Coop, 2016. "A Spatial Framework for Understanding Population Structure and Admixture," PLOS Genetics, Public Library of Science, vol. 12(1), pages 1-38, January.
    5. Marco Lopez-Cruz & Fernando M. Aguate & Jacob D. Washburn & Natalia Leon & Shawn M. Kaeppler & Dayane Cristina Lima & Ruijuan Tan & Addie Thompson & Laurence Willard Bretonne & Gustavo los Campos, 2023. "Leveraging data from the Genomes-to-Fields Initiative to investigate genotype-by-environment interactions in maize in North America," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    6. Beatrix Eugster & Rafael Lalive & Andreas Steinhauer & Josef Zweimüller, 2011. "The Demand for Social Insurance: Does Culture Matter?," Economic Journal, Royal Economic Society, vol. 121(556), pages 413-448, November.
    7. Filippini, Massimo & Wekhof, Tobias, 2021. "The effect of culture on energy efficient vehicle ownership," Journal of Environmental Economics and Management, Elsevier, vol. 105(C).
    8. Steinrücken, Matthias & Paul, Joshua S. & Song, Yun S., 2013. "A sequentially Markov conditional sampling distribution for structured populations with migration and recombination," Theoretical Population Biology, Elsevier, vol. 87(C), pages 51-61.
    9. Andrey V Khrunin & Denis V Khokhrin & Irina N Filippova & Tõnu Esko & Mari Nelis & Natalia A Bebyakova & Natalia L Bolotova & Janis Klovins & Liene Nikitina-Zake & Karola Rehnström & Samuli Ripatti & , 2013. "A Genome-Wide Analysis of Populations from European Russia Reveals a New Pole of Genetic Diversity in Northern Europe," PLOS ONE, Public Library of Science, vol. 8(3), pages 1-9, March.
    10. Wenhan Chen & Yang Wu & Zhili Zheng & Ting Qi & Peter M. Visscher & Zhihong Zhu & Jian Yang, 2021. "Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors," Nature Communications, Nature, vol. 12(1), pages 1-10, December.
    11. Pierre Luisi & Angelina García & Juan Manuel Berros & Josefina M B Motti & Darío A Demarchi & Emma Alfaro & Eliana Aquilano & Carina Argüelles & Sergio Avena & Graciela Bailliet & Julieta Beltramo & C, 2020. "Fine-scale genomic analyses of admixed individuals reveal unrecognized genetic ancestry components in Argentina," PLOS ONE, Public Library of Science, vol. 15(7), pages 1-30, July.
    12. Brielin C Brown & Nicolas L Bray & Lior Pachter, 2018. "Expression reflects population structure," PLOS Genetics, Public Library of Science, vol. 14(12), pages 1-15, December.
    13. Gad Abraham & Michael Inouye, 2014. "Fast Principal Component Analysis of Large-Scale Genome-Wide Data," PLOS ONE, Public Library of Science, vol. 9(4), pages 1-5, April.
    14. Beatrix Brügger & Rafael Lalive & Josef Zweimüller, 2009. "Does Culture Affect Unemployment? Evidence from the Röstigraben," NRN working papers 2009-10, The Austrian Center for Labor Economics and the Analysis of the Welfare State, Johannes Kepler University Linz, Austria.
    15. Diana Chang & Alon Keinan, 2014. "Principal Component Analysis Characterizes Shared Pathogenetics from Genome-Wide Association Studies," PLOS Computational Biology, Public Library of Science, vol. 10(9), pages 1-14, September.
    16. Alejandro Ochoa & John D Storey, 2021. "Estimating FST and kinship for arbitrary population structures," PLOS Genetics, Public Library of Science, vol. 17(1), pages 1-36, January.
    17. Victor Ronda & Esben Agerbo & Dorthe Bleses & Preben Bo Mortensen & Anders Børglum & Ole Mors & Michael Rosholm & David M. Hougaard & Merete Nordentoft & Thomas Werge, 2022. "Family disadvantage, gender, and the returns to genetic human capital," Scandinavian Journal of Economics, Wiley Blackwell, vol. 124(2), pages 550-578, April.
    18. Feldman, Michael J., 2023. "Spiked singular values and vectors under extreme aspect ratios," Journal of Multivariate Analysis, Elsevier, vol. 196(C).
    19. Nicola Barban & Elisabetta De Cao & Sonia Oreffice & Climent Quintana-Domeque, 2016. "Assortative Mating on Education: A Genetic Assessment," Working Papers 2016-034, Human Capital and Economic Opportunity Working Group.
    20. Bryc, Katarzyna & Bryc, Wlodek & Silverstein, Jack W., 2013. "Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete subpopulations," Theoretical Population Biology, Elsevier, vol. 89(C), pages 34-43.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1004842. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.