Advanced Search
MyIDEAS: Login to save this article or follow this journal

A fast and recursive algorithm for clustering large datasets with k-medians

Contents:

Author Info

  • Cardot, Hervé
  • Cénac, Peggy
  • Monnez, Jean-Marie
Registered author(s):

    Abstract

    Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. A new class of recursive stochastic gradient algorithms designed for the k-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions which are known to have better performances. A data-driven procedure that permits a fully automatic selection of the value of the descent step is also proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as k-means, trimmed k-means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.

    Download Info

    If you experience problems downloading a file, check if you have the proper application to view it first. In case of further problems read the IDEAS help page. Note that these files are not on the IDEAS site. Please be patient as the files may be large.
    File URL: http://www.sciencedirect.com/science/article/pii/S0167947311004178
    Download Restriction: Full text for ScienceDirect subscribers only.

    As the access to this document is restricted, you may want to look for a different version under "Related research" (further below) or search for a different version of it.

    Bibliographic Info

    Article provided by Elsevier in its journal Computational Statistics & Data Analysis.

    Volume (Year): 56 (2012)
    Issue (Month): 6 ()
    Pages: 1434-1449

    as in new window
    Handle: RePEc:eee:csdana:v:56:y:2012:i:6:p:1434-1449

    Contact details of provider:
    Web page: http://www.elsevier.com/locate/csda

    Related research

    Keywords: Averaging; High dimensional data; k-medoids; Online clustering; Partitioning around medoids; Recursive estimators; Robbins–Monro; Stochastic approximation; Stochastic gradient;

    References

    References listed on IDEAS
    Please report citation or reference errors to , or , if you are the registered author of the cited work, log in to your RePEc Author Service profile, click on "citations" and make appropriate adjustments.:
    as in new window
    1. Croux, Christophe & Gallopoulos, Efstratios & Van Aelst, Stefan & Zha, Hongyuan, 2007. "Machine Learning and Robust Data Mining," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 151-154, September.
    2. Luis García-Escudero & Alfonso Gordaliza & Carlos Matrán & Agustín Mayo-Iscar, 2010. "A review of robust clustering methods," Advances in Data Analysis and Classification, Springer, vol. 4(2), pages 89-109, September.
    3. Monnez, Jean-Marie, 2006. "Almost sure convergence of stochastic gradient processes with matrix step sizes," Statistics & Probability Letters, Elsevier, vol. 76(5), pages 531-536, March.
    4. García-Treviño, E.S. & Barria, J.A., 2012. "Online wavelet-based density estimation for non-stationary streaming data," Computational Statistics & Data Analysis, Elsevier, vol. 56(2), pages 327-344.
    Full references (including those not matched with items on IDEAS)

    Citations

    Lists

    This item is not listed on Wikipedia, on a reading list or among the top items on IDEAS.

    Statistics

    Access and download statistics

    Corrections

    When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:56:y:2012:i:6:p:1434-1449. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Zhang, Lei).

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If references are entirely missing, you can add them using this form.

    If the full references list an item that is present in RePEc, but the system did not link to it, you can help with this form.

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.