IDEAS home Printed from https://ideas.repec.org/a/eee/ejores/v234y2014i3p720-730.html
   My bibliography  Save this article

Mining categorical sequences from data using a hybrid clustering method

Author

Listed:
  • De Angelis, Luca
  • Dias, José G.

Abstract

The identification of different dynamics in sequential data has become an every day need in scientific fields such as marketing, bioinformatics, finance, or social sciences. Contrary to cross-sectional or static data, this type of observations (also known as stream data, temporal data, longitudinal data or repeated measures) are more challenging as one has to incorporate data dependency in the clustering process. In this research we focus on clustering categorical sequences. The method proposed here combines model-based and heuristic clustering. In the first step, the categorical sequences are transformed by an extension of the hidden Markov model into a probabilistic space, where a symmetric Kullback–Leibler distance can operate. Then, in the second step, using hierarchical clustering on the matrix of distances, the sequences can be clustered. This paper illustrates the enormous potential of this type of hybrid approach using a synthetic data set as well as the well-known Microsoft dataset with website users search patterns and a survey on job career dynamics.

Suggested Citation

  • De Angelis, Luca & Dias, José G., 2014. "Mining categorical sequences from data using a hybrid clustering method," European Journal of Operational Research, Elsevier, vol. 234(3), pages 720-730.
  • Handle: RePEc:eee:ejores:v:234:y:2014:i:3:p:720-730
    DOI: 10.1016/j.ejor.2013.11.002
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0377221713009016
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ejor.2013.11.002?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Basalto, Nicolas & Bellotti, Roberto & De Carlo, Francesco & Facchi, Paolo & Pantaleo, Ester & Pascazio, Saverio, 2007. "Hausdorff clustering of financial time series," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 379(2), pages 635-644.
    2. Rees, Jackie & Koehler, Gary J., 2006. "Learning genetic algorithm parameters using hidden Markov models," European Journal of Operational Research, Elsevier, vol. 175(2), pages 806-820, December.
    3. Dong, Ming & He, David, 2007. "Hidden semi-Markov model-based methodology for multi-sensor equipment health diagnosis and prognosis," European Journal of Operational Research, Elsevier, vol. 178(3), pages 858-878, May.
    4. Sofia B. Ramos & Jeroen K. Vermunt & José G. Dias, 2011. "When markets fall down: are emerging markets all the same?," International Journal of Finance & Economics, John Wiley & Sons, Ltd., vol. 16(4), pages 324-338, October.
    5. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    6. Yaesoubi, Reza & Cohen, Ted, 2011. "Generalized Markov models of infectious disease spread: A novel framework for developing dynamic health policies," European Journal of Operational Research, Elsevier, vol. 215(3), pages 679-687, December.
    7. Guerry, Marie-Anne, 2011. "Hidden heterogeneity in manpower systems: A Markov-switching model approach," European Journal of Operational Research, Elsevier, vol. 210(1), pages 106-113, April.
    8. Fruhwirth-Schnatter, Sylvia & Kaufmann, Sylvia, 2008. "Model-Based Clustering of Multiple Time Series," Journal of Business & Economic Statistics, American Statistical Association, vol. 26, pages 78-89, January.
    9. Inniss, Tasha R., 2006. "Seasonal clustering technique for time series data," European Journal of Operational Research, Elsevier, vol. 175(1), pages 376-384, November.
    10. Jose Dias & Frans Willekens, 2005. "Model-based Clustering of Sequential Data with an Application to Contraceptive Use Dynamics," Mathematical Population Studies, Taylor & Francis Journals, vol. 12(3), pages 135-157.
    11. Raftery, Adrian E., 1985. "Time series analysis," European Journal of Operational Research, Elsevier, vol. 20(2), pages 127-137, May.
    12. Zhou, Zhi-Jie & Hu, Chang-Hua & Xu, Dong-Ling & Chen, Mao-Yin & Zhou, Dong-Hua, 2010. "A model for real-time failure prognosis based on hidden Markov model and belief rule base," European Journal of Operational Research, Elsevier, vol. 207(1), pages 269-283, November.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Marco Guerra & Francesca Bassi & José G. Dias, 2020. "A Multiple-Indicator Latent Growth Mixture Model to Track Courses with Low-Quality Teaching," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 147(2), pages 361-381, January.
    2. Trindade, Graça & Dias, José G. & Ambrósio, Jorge, 2017. "Extracting clusters from aggregate panel data: A market segmentation study," Applied Mathematics and Computation, Elsevier, vol. 296(C), pages 277-288.
    3. Huang, Yan & Kou, Gang & Peng, Yi, 2017. "Nonlinear manifold learning for early warnings in financial markets," European Journal of Operational Research, Elsevier, vol. 258(2), pages 692-702.
    4. Rota Bulò, Samuel & Pelillo, Marcello, 2017. "Dominant-set clustering: A review," European Journal of Operational Research, Elsevier, vol. 262(1), pages 1-13.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Dias, José G. & Vermunt, Jeroen K. & Ramos, Sofia, 2015. "Clustering financial time series: New insights from an extended hidden Markov model," European Journal of Operational Research, Elsevier, vol. 243(3), pages 852-864.
    2. Tyler Roick & Dimitris Karlis & Paul D. McNicholas, 2021. "Clustering discrete-valued time series," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 15(1), pages 209-229, March.
    3. Luca De Angelis, 2013. "Latent class models for financial data analysis: some statistical developments," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 22(2), pages 227-242, June.
    4. Zhang, Zhengxin & Si, Xiaosheng & Hu, Changhua & Lei, Yaguo, 2018. "Degradation data analysis and remaining useful life estimation: A review on Wiener-process-based methods," European Journal of Operational Research, Elsevier, vol. 271(3), pages 775-796.
    5. Robert Darkins & Emma J Cooke & Zoubin Ghahramani & Paul D W Kirk & David L Wild & Richard S Savage, 2013. "Accelerating Bayesian Hierarchical Clustering of Time Series Data with a Randomised Algorithm," PLOS ONE, Public Library of Science, vol. 8(4), pages 1-9, April.
    6. Naumzik, Christof & Feuerriegel, Stefan & Nielsen, Anne Molgaard, 2023. "Data-driven dynamic treatment planning for chronic diseases," European Journal of Operational Research, Elsevier, vol. 305(2), pages 853-867.
    7. Jiang, P. & Liu, X., 2016. "Hidden Markov model for municipal waste generation forecasting under uncertainties," European Journal of Operational Research, Elsevier, vol. 250(2), pages 639-651.
    8. Luca De Angelis & Leonard J. Paas, 2013. "A dynamic analysis of stock markets using a hidden Markov model," Journal of Applied Statistics, Taylor & Francis Journals, vol. 40(8), pages 1682-1700, August.
    9. Edoardo Otranto & Massimo Mucciardi, 2019. "Clustering space-time series: FSTAR as a flexible STAR approach," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 175-199, March.
    10. Antonis A. Michis, 2021. "Wavelet Multidimensional Scaling Analysis of European Economic Sentiment Indicators," Journal of Classification, Springer;The Classification Society, vol. 38(3), pages 443-480, October.
    11. Maharaj, Elizabeth Ann & D’Urso, Pierpaolo, 2010. "A coherence-based approach for the pattern recognition of time series," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 389(17), pages 3516-3537.
    12. E. Otranto & M. Mucciardi, 2017. "Clustering Space-Time Series: A Flexible STAR Approach," Working Paper CRENoS 201707, Centre for North South Economic Research, University of Cagliari and Sassari, Sardinia.
    13. Lu, Emiao & Handl, Julia & Xu, Dong-ling, 2018. "Determining analogies based on the integration of multiple information sources," International Journal of Forecasting, Elsevier, vol. 34(3), pages 507-528.
    14. Si, Xiao-Sheng & Wang, Wenbin & Hu, Chang-Hua & Zhou, Dong-Hua, 2011. "Remaining useful life estimation - A review on the statistical data driven approaches," European Journal of Operational Research, Elsevier, vol. 213(1), pages 1-14, August.
    15. Wu, Han-Ming & Tien, Yin-Jing & Chen, Chun-houh, 2010. "GAP: A graphical environment for matrix visualization and cluster analysis," Computational Statistics & Data Analysis, Elsevier, vol. 54(3), pages 767-778, March.
    16. José E. Chacón, 2021. "Explicit Agreement Extremes for a 2 × 2 Table with Given Marginals," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 257-263, July.
    17. Trindade, Graça & Dias, José G. & Ambrósio, Jorge, 2017. "Extracting clusters from aggregate panel data: A market segmentation study," Applied Mathematics and Computation, Elsevier, vol. 296(C), pages 277-288.
    18. Roberto Rocci & Stefano Antonio Gattone & Roberto Di Mari, 2018. "A data driven equivariant approach to constrained Gaussian mixture modeling," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(2), pages 235-260, June.
    19. Redivo, Edoardo & Nguyen, Hien D. & Gupta, Mayetri, 2020. "Bayesian clustering of skewed and multimodal data using geometric skewed normal distributions," Computational Statistics & Data Analysis, Elsevier, vol. 152(C).
    20. Charles Bouveyron & Julien Jacques, 2011. "Model-based clustering of time series in group-specific functional subspaces," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 5(4), pages 281-300, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:ejores:v:234:y:2014:i:3:p:720-730. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/eor .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.