IDEAS home Printed from https://ideas.repec.org/a/jss/jstsof/v072i03.html
   My bibliography  Save this article

Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package

Author

Listed:
  • Gabadinho, Alexis
  • Ritschard, Gilbert

Abstract

This article presents the PST R package for categorical sequence analysis with probabilistic suffix trees (PSTs), i.e., structures that store variable-length Markov chains (VLMCs). VLMCs allow to model high-order dependencies in categorical sequences with parsimonious models based on simple estimation procedures. The package is specifically adapted to the field of social sciences, as it allows for VLMC models to be learned from sets of individual sequences possibly containing missing values; in addition, the package is extended to account for case weights. This article describes how a VLMC model is learned from one or more categorical sequences and stored in a PST. The PST can then be used for sequence prediction, i.e., to assign a probability to whole observed or artificial sequences. This feature supports data mining applications such as the extraction of typical patterns and outliers. This article also introduces original visualization tools for both the model and the outcomes of sequence prediction. Other features such as functions for pattern mining and artificial sequence generation are described as well. The PST package also allows for the computation of probabilistic divergence between two models and the fitting of segmented VLMCs, where sub-models fitted to distinct strata of the learning sample are stored in a single PST.

Suggested Citation

  • Gabadinho, Alexis & Ritschard, Gilbert, 2016. "Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 72(i03).
  • Handle: RePEc:jss:jstsof:v:072:i03
    DOI: http://hdl.handle.net/10.18637/jss.v072.i03
    as

    Download full text from publisher

    File URL: https://www.jstatsoft.org/index.php/jss/article/view/v072i03/v72i03.pdf
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v072i03/PST_0.90.tar.gz
    Download Restriction: no

    File URL: https://www.jstatsoft.org/index.php/jss/article/downloadSuppFile/v072i03/v72i03.R
    Download Restriction: no

    File URL: https://libkey.io/http://hdl.handle.net/10.18637/jss.v072.i03?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Joel H. Levine, 2000. "But What Have You Done for Us Lately?," Sociological Methods & Research, , vol. 29(1), pages 34-40, August.
    2. P. J. Avery & D. A. Henderson, 1999. "Fitting Markov chain models to discrete state series such as DNA sequences," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 48(1), pages 53-61.
    3. Smith, Anthony M. A. & Shelley, Julia M. & Dennerstein, Lorraine, 1994. "Self-rated health: Biological continuum or social discontinuity?," Social Science & Medicine, Elsevier, vol. 39(1), pages 77-83, July.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Ioannis Kontoyiannis & Lambros Mertzanis & Athina Panotopoulou & Ioannis Papageorgiou & Maria Skoularidou, 2022. "Bayesian context trees: Modelling and exact inference for discrete time series," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 84(4), pages 1287-1323, September.
    2. Donald E. K. Martin, 2020. "Distributions of pattern statistics in sparse Markov models," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 72(4), pages 895-913, August.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. García-Muñoz, Teresa & Neuman, Shoshana & Neuman, Tzahi, 2014. "Health Risk Factors among the Older European Populations: Personal and Country Effects," IZA Discussion Papers 8529, Institute of Labor Economics (IZA).
    2. Prus, Steven G., 2011. "Comparing social determinants of self-rated health across the United States and Canada," Social Science & Medicine, Elsevier, vol. 73(1), pages 50-59, July.
    3. Raffaella Piccarreta & Francesco C. Billari, 2007. "Clustering work and family trajectories by using a divisive algorithm," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 170(4), pages 1061-1078, October.
    4. Jonathan Houdmont & Liza Jachens & Raymond Randall & Sadie Hopson & Sean Nuttall & Stamatia Pamia, 2019. "What Does a Single-Item Measure of Job Stressfulness Assess?," IJERPH, MDPI, vol. 16(9), pages 1-15, April.
    5. Anastasios N. Arapis & Frosso S. Makri & Zaharias M. Psillakis, 2017. "Joint distribution of k-tuple statistics in zero-one sequences of Markov-dependent trials," Journal of Statistical Distributions and Applications, Springer, vol. 4(1), pages 1-13, December.
    6. Jylhä, Marja, 2009. "What is self-rated health and why does it predict mortality? Towards a unified conceptual model," Social Science & Medicine, Elsevier, vol. 69(3), pages 307-316, August.
    7. Laurent Lesnard, 2006. "Optimal Matching and Social Sciences," Working Papers 2006-01, Center for Research in Economics and Statistics.
    8. Anyadike-Danes, Michael & McVicar, Duncan, 2005. "You'll never walk alone: Childhood influences and male career path clusters," Labour Economics, Elsevier, vol. 12(4), pages 511-530, August.
    9. Joanna Jurewicz & Dorota Kaleta, 2020. "Correlates of Poor Self-Assessed Health Status among Socially Disadvantaged Populations in Poland," IJERPH, MDPI, vol. 17(4), pages 1-18, February.
    10. Duncan Thomas & Elizabeth Frankenberg, 2001. "The Measurement and Interpretation of Health in Social Surveys," Working Papers 01-06, RAND Corporation.
    11. J. Besag & D. Mondal, 2013. "Exact Goodness-of-Fit Tests for Markov Chains," Biometrics, The International Biometric Society, vol. 69(2), pages 488-496, June.
    12. repec:aaa:journl:v:3:y:1999:i:1:p:87-100 is not listed on IDEAS
    13. Teresa García-Muñoz & Shoshana Neuman & Tzahi Neuman, 2014. "Subjective Health Status of the Older Population: Is It Related to Country-Specific Economic Development Measures?," Working Papers 2014-02, Bar-Ilan University, Department of Economics.
    14. Rainer Reile & Mall Leinsalu, 2013. "Differentiating positive and negative self-rated health: results from a cross-sectional study in Estonia," International Journal of Public Health, Springer;Swiss School of Public Health (SSPH+), vol. 58(4), pages 555-564, August.
    15. Jonsson, Robert, 2011. "A Markov Chain Model for Analysing the Progression of Patient’s Health States," Research Reports 2011:6, University of Gothenburg, Statistical Research Unit, School of Business, Economics and Law.
    16. Hanly, Mark & Clarke, Paul & Steele, Fiona, 2016. "Sequence analysis of call record data: exploring the role of different cost settings," LSE Research Online Documents on Economics 64896, London School of Economics and Political Science, LSE Library.
    17. Catherine Gaumé & Guillaume Wunsch, 2010. "Self-Rated Health in the Baltic Countries, 1994–1999," European Journal of Population, Springer;European Association for Population Studies, vol. 26(4), pages 435-457, November.
    18. M. L. Menéndez & L. Pardo & M. C. Pardo & K. Zografos, 2011. "Testing the Order of Markov Dependence in DNA Sequences," Methodology and Computing in Applied Probability, Springer, vol. 13(1), pages 59-74, March.
    19. Varin, Cristiano & Vidoni, Paolo, 2006. "Pairwise likelihood inference for ordinal categorical time series," Computational Statistics & Data Analysis, Elsevier, vol. 51(4), pages 2365-2373, December.
    20. Jonsson, Robert, 2011. "Tests of Markov Order and Homogeneity in a Markov Chain," Research Reports 2011:7, University of Gothenburg, Statistical Research Unit, School of Business, Economics and Law.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:jss:jstsof:v:072:i03. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Christopher F. Baum (email available below). General contact details of provider: http://www.jstatsoft.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.