IDEAS home Printed from https://ideas.repec.org/a/eee/thpobi/v163y2025icp62-79.html
   My bibliography  Save this article

A matrix-analytical sampling formula for time-homogeneous coalescent processes under the infinite sites mutation model

Author

Listed:
  • Hobolth, Asger
  • Boitard, Simon
  • Futschik, Andreas
  • Leblois, Raphael

Abstract

In this paper we develop a general framework for calculating the probability of a genetic sample under a time-homogeneous coalescent process and the infinite sites mutation model. The evolutionary model that we consider can be characterized as a two-step procedure: A coalescent process that describes the ancestral relatedness of the samples and a sprinkling of mutations in separate sites on the ancestral tree according to a Poisson process. The coalescent process is defined using multivariate phase-type theory. The requirements are a rate matrix that determines the transition rates between the ancestral states, an initial state probability vector, and a reward matrix that informs about the characteristics of the ancestral states. For example, the reward matrix could contain information about the number of singleton, doubleton or higher-order lineages in the ancestral states. We analyze the probability generating function for the evolutionary model as a function of the initial state probability vector, the transition rate matrix, the reward matrix, and the mutation rate. The matrix-analytical expression of the probability generating function allows us to develop a general method for calculating the probability of a population genetic data set. We demonstrate that the method is computationally attractive for a small number of mutations and provide a simple and easy-to-implement algorithm for determining the probability of a sample from the evolutionary model. The method is computationally stable and only involves a single inverse matrix operation, matrix multiplications and matrix additions. We provide comprehensive understanding of the procedure by detailed calculations and discussions of several elementary examples. These examples include different sample representations (labeled samples and the site frequency spectrum) and different demographic and genetic models (the structured coalescent and the Beta-coalescent). We apply the sampling formula to calculate probabilities of spectra for the Kingman coalescent and the Beta-coalescent. Even for a small number of samples and mutations we find that the probabilities for spectra vary in huge orders of magnitudes. We compare the probabilities of the spectra to the values of Tajima’s D-statistics, and find that the D-statistic is a poor predictor for the probability of a spectrum. Finally, we investigate how the probabilities of the spectra vary with the parametrization of the Beta-coalescent.

Suggested Citation

  • Hobolth, Asger & Boitard, Simon & Futschik, Andreas & Leblois, Raphael, 2025. "A matrix-analytical sampling formula for time-homogeneous coalescent processes under the infinite sites mutation model," Theoretical Population Biology, Elsevier, vol. 163(C), pages 62-79.
  • Handle: RePEc:eee:thpobi:v:163:y:2025:i:c:p:62-79
    DOI: 10.1016/j.tpb.2025.03.002
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S004058092500019X
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.tpb.2025.03.002?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Hobolth, Asger & Siri-Jégousse, Arno & Bladt, Mogens, 2019. "Phase-type distributions in population genetics," Theoretical Population Biology, Elsevier, vol. 127(C), pages 16-32.
    2. Birkner, Matthias & Blath, Jochen & Steinrücken, Matthias, 2011. "Importance sampling for Lambda-coalescents in the infinitely many sites model," Theoretical Population Biology, Elsevier, vol. 79(4), pages 155-173.
    3. V. G. Kulkarni, 1989. "A New Class of Multivariate Phase Type Distributions," Operations Research, INFORMS, vol. 37(1), pages 151-158, February.
    4. Kumagai, Seiji & Uyenoyama, Marcy K., 2015. "Genealogical histories in structured populations," Theoretical Population Biology, Elsevier, vol. 102(C), pages 3-15.
    5. Hobolth, Asger & Rivas-González, Iker & Bladt, Mogens & Futschik, Andreas, 2024. "Phase-type distributions in mathematical population genetics: An emerging framework," Theoretical Population Biology, Elsevier, vol. 157(C), pages 14-32.
    6. repec:plo:pgen00:1003905 is not listed on IDEAS
    7. Hobolth Asger & Uyenoyama Marcy K & Wiuf Carsten, 2008. "Importance Sampling for the Infinite Sites Model," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 7(1), pages 1-26, October.
    8. Uyenoyama, Marcy K. & Takebayashi, Naoki & Kumagai, Seiji, 2019. "Inductive determination of allele frequency spectrum probabilities in structured populations," Theoretical Population Biology, Elsevier, vol. 129(C), pages 148-159.
    9. Griffiths, Robert C. & Tavaré, Simon, 2018. "Ancestral inference from haplotypes and mutations," Theoretical Population Biology, Elsevier, vol. 122(C), pages 12-21.
    10. Gertjan Bisschop, 2022. "Graph-based algorithms for Laplace transformed coalescence time distributions," PLOS Computational Biology, Public Library of Science, vol. 18(9), pages 1-13, September.
    11. Chen, Hua, 2012. "The joint allele frequency spectrum of multiple populations: A coalescent theory approach," Theoretical Population Biology, Elsevier, vol. 81(2), pages 179-195.
    12. repec:plo:pgen00:1000695 is not listed on IDEAS
    13. Ganapathy, Ganeshkumar & Uyenoyama, Marcy K., 2009. "Site frequency spectra from genomic SNP surveys," Theoretical Population Biology, Elsevier, vol. 75(4), pages 346-354.
    14. Hankin, Robin K. S. & West, Luke J., 2007. "Set Partitions in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 23(c02).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Hobolth, Asger & Rivas-González, Iker & Bladt, Mogens & Futschik, Andreas, 2024. "Phase-type distributions in mathematical population genetics: An emerging framework," Theoretical Population Biology, Elsevier, vol. 157(C), pages 14-32.
    2. Arredondo, Armando & Corujo, Josué & Noûs, Camille & Boitard, Simon & Chikhi, Lounès & Mazet, Olivier, 2025. "Exact calculation of the expected SFS in structured populations," Theoretical Population Biology, Elsevier, vol. 163(C), pages 50-61.
    3. Costa, Rui J. & Wilkinson-Herbots, Hilde M., 2021. "Inference of gene flow in the process of speciation: Efficient maximum-likelihood implementation of a generalised isolation-with-migration model," Theoretical Population Biology, Elsevier, vol. 140(C), pages 1-15.
    4. Uyenoyama, Marcy K. & Takebayashi, Naoki & Kumagai, Seiji, 2020. "Allele frequency spectra in structured populations: Novel-allele probabilities under the labelled coalescent," Theoretical Population Biology, Elsevier, vol. 133(C), pages 130-140.
    5. Attias, Laurent & Siess, Vincent & Labbé, Stéphane, 2025. "An agile modeling framework for population dynamics," Mathematics and Computers in Simulation (MATCOM), Elsevier, vol. 234(C), pages 113-134.
    6. Uyenoyama, Marcy K., 2024. "Joint identity among loci under mutation and regular inbreeding," Theoretical Population Biology, Elsevier, vol. 159(C), pages 74-90.
    7. Bo Friis Nielsen, 2022. "Characterisation of multivariate phase type distributions," Queueing Systems: Theory and Applications, Springer, vol. 100(3), pages 229-231, April.
    8. Li, Haijun, 2003. "Association of multivariate phase-type distributions, with applications to shock models," Statistics & Probability Letters, Elsevier, vol. 64(4), pages 381-392, October.
    9. Riccardo De Bin & Vegard Grødem Stikbakke, 2023. "A boosting first-hitting-time model for survival analysis in high-dimensional settings," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 29(2), pages 420-440, April.
    10. Mikula, Lynette Caitlin & Vogl, Claus, 2024. "The expected sample allele frequencies from populations of changing size via orthogonal polynomials," Theoretical Population Biology, Elsevier, vol. 157(C), pages 55-85.
    11. Sainudiin, Raazesh & Véber, Amandine, 2018. "Full likelihood inference from the site frequency spectrum based on the optimal tree resolution," Theoretical Population Biology, Elsevier, vol. 124(C), pages 1-15.
    12. Qi-Ming He & Jiandong Ren, 2016. "Analysis of a Multivariate Claim Process," Methodology and Computing in Applied Probability, Springer, vol. 18(1), pages 257-273, March.
    13. Hansjörg Albrecher & Martin Bladt & Mogens Bladt, 2021. "Multivariate matrix Mittag–Leffler distributions," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 73(2), pages 369-394, April.
    14. Legried, Brandon & Terhorst, Jonathan, 2022. "Rates of convergence in the two-island and isolation-with-migration models," Theoretical Population Biology, Elsevier, vol. 147(C), pages 16-27.
    15. Cheung, Eric C.K. & Peralta, Oscar & Woo, Jae-Kyung, 2022. "Multivariate matrix-exponential affine mixtures and their applications in risk theory," Insurance: Mathematics and Economics, Elsevier, vol. 106(C), pages 364-389.
    16. Badila, E.S. & Boxma, O.J. & Resing, J.A.C., 2015. "Two parallel insurance lines with simultaneous arrivals and risks correlated with inter-arrival times," Insurance: Mathematics and Economics, Elsevier, vol. 61(C), pages 48-61.
    17. Blath, Jochen & Buzzoni, Eugenio & Koskela, Jere & Wilke Berenguer, Maite, 2020. "Statistical tools for seed bank detection," Theoretical Population Biology, Elsevier, vol. 132(C), pages 1-15.
    18. Hansjörg Albrecher & Mogens Bladt & Jorge Yslas, 2022. "Fitting inhomogeneous phase‐type distributions to data: the univariate and the multivariate case," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 49(1), pages 44-77, March.
    19. Berdel, Jasmin & Hipp, Christian, 2011. "Convolutions of multivariate phase-type distributions," Insurance: Mathematics and Economics, Elsevier, vol. 48(3), pages 374-377, May.
    20. Haijun Li & Susan H. Xu, 2001. "Directionally Convex Comparison of Correlated First Passage Times," Methodology and Computing in Applied Probability, Springer, vol. 3(4), pages 365-378, December.

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:thpobi:v:163:y:2025:i:c:p:62-79. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: https://www.sciencedirect.com/journal/theoretical-population-biology .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.