IDEAS home Printed from https://ideas.repec.org/a/bla/jamest/v51y2000i12p1114-1122.html
   My bibliography  Save this article

A comparison of techniques to find mirrored hosts on the WWW

Author

Listed:
  • Krishna Bharat
  • Andrei Broder
  • Jeffrey Dean
  • Monika R. Henzinger

Abstract

We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information about Web pages easily available from Web proxies and crawlers. Identification of mirrored hosts can improve Web‐based information retrieval in several ways: first, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents—mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of Web browsers and proxies. We evaluated four classes of “top‐down” algorithms for detecting mirrored host pairs (that is, algorithms that are based on page attributes such as URL, IP address, and hyperlinks between pages, and not on the page content) on a collection of 140 million URLs (on 230,000 hosts) and their associated connectivity information. Our best approach is one which combines five algorithms and achieved a precision of 0.57 for a recall of 0.86 considering 100,000 ranked host pairs.

Suggested Citation

  • Krishna Bharat & Andrei Broder & Jeffrey Dean & Monika R. Henzinger, 2000. "A comparison of techniques to find mirrored hosts on the WWW," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 51(12), pages 1114-1122.
  • Handle: RePEc:bla:jamest:v:51:y:2000:i:12:p:1114-1122
    DOI: 10.1002/1097-4571(2000)9999:99993.0.CO;2-0
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/1097-4571(2000)9999:99993.0.CO;2-0
    Download Restriction: no

    File URL: https://libkey.io/10.1002/1097-4571(2000)9999:99993.0.CO;2-0?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jamest:v:51:y:2000:i:12:p:1114-1122. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.