IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v6y2021i7p73-d590141.html
   My bibliography  Save this article

A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

Author

Listed:
  • Salah Taamneh

    (Department of Computer Science, The Hashemite University, Zarqa 13133, Jordan
    These authors contributed equally to this work.)

  • Mo’taz Al-Hami

    (Department of Computer Information Systems, The Hashemite University, Zarqa 13133, Jordan
    These authors contributed equally to this work.)

  • Hani Bani-Salameh

    (Department of Software Engineering, The Hashemite University, Zarqa 13133, Jordan
    These authors contributed equally to this work.)

  • Alaa E. Abdallah

    (Department of Computer Science, The Hashemite University, Zarqa 13133, Jordan
    These authors contributed equally to this work.)

Abstract

Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance.

Suggested Citation

  • Salah Taamneh & Mo’taz Al-Hami & Hani Bani-Salameh & Alaa E. Abdallah, 2021. "A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines," Data, MDPI, vol. 6(7), pages 1-23, July.
  • Handle: RePEc:gam:jdataj:v:6:y:2021:i:7:p:73-:d:590141
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/6/7/73/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/6/7/73/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:6:y:2021:i:7:p:73-:d:590141. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.