IDEAS home Printed from https://ideas.repec.org/a/bla/jamist/v61y2010i6p1092-1104.html
   My bibliography  Save this article

High‐speed rough clustering for very large document collections

Author

Listed:
  • Kazuaki Kishida

Abstract

Document clustering is an important tool, but it is not yet widely used in practice probably because of its high computational complexity. This article explores techniques of high‐speed rough clustering of documents, assuming that it is sometimes necessary to obtain a clustering result in a shorter time, although the result is just an approximate outline of document clusters. A promising approach for such clustering is to reduce the number of documents to be checked for generating cluster vectors in the leader–follower clustering algorithm. Based on this idea, the present article proposes a modified Crouch algorithm and incomplete single‐pass leader–follower algorithm. Also, a two‐stage grouping technique, in which the first stage attempts to decrease the number of documents to be processed in the second stage by applying a quick merging technique, is developed. An experiment using a part of the Reuters corpus RCV1 showed empirically that both the modified Crouch and the incomplete single‐pass leader–follower algorithms achieve clustering results more efficiently than the original methods, and also improved the effectiveness of clustering results. On the other hand, the two‐stage grouping technique did not reduce the processing time in this experiment.

Suggested Citation

  • Kazuaki Kishida, 2010. "High‐speed rough clustering for very large document collections," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 61(6), pages 1092-1104, June.
  • Handle: RePEc:bla:jamist:v:61:y:2010:i:6:p:1092-1104
    DOI: 10.1002/asi.21311
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.21311
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.21311?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jamist:v:61:y:2010:i:6:p:1092-1104. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.