IDEAS home Printed from https://ideas.repec.org/a/spr/infosf/vyid10.1007_s10796-016-9701-7.html
   My bibliography  Save this article

Improving the freshness of the search engines by a probabilistic approach based incremental crawler

Author

Listed:
  • G. Pavai

    (CEG, Anna University)

  • T. V. Geetha

    (CEG, Anna University)

Abstract

Web is flooded with data. While the crawler is responsible for accessing these web pages and giving it to the indexer for making them available to the users of search engine, the rate at which these web pages change has created the necessity for the crawler to employ refresh strategies to give updated/modified content to the search engine users. Furthermore, Deep web is that part of the web that has alarmingly abundant amounts of quality data (when compared to normal/surface web) but not technically accessible to a search engine’s crawler. The existing deep web crawl methods helps to access the deep web data from the result pages that are generated by filling forms with a set of queries and accessing the web databases through them. However, these methods suffer from not being able to maintain the freshness of the local databases. Both the surface web and the deep web needs an incremental crawl associated with the normal crawl architecture to overcome this problem. Crawling the deep web requires the selection of an appropriate set of queries so that they can cover almost all the records in the data source and in addition the overlapping of records should be low so that network utilization is reduced. An incremental crawl adds to an increase in the network utilization with every increment. Therefore, a reduced query set as described earlier should be used in order to minimize the network utilization. Our contributions in this work are the design of a probabilistic approach based incremental crawler to handle the dynamic changes of the surface web pages, adapting the above mentioned method with a modification to handle the dynamic changes in the deep web databases, a new evaluation measure called the ‘Crawl-hit rate’ to evaluate the efficiency of the incremental crawler in terms of the number of times the crawl is actually necessary in the predicted time and a semantic weighted set covering algorithm for reducing the queries so that the network cost is reduced for every increment of the crawl without any compromise in the number of records retrieved. The evaluation of incremental crawler shows a good improvement in the freshness of the databases and a good Crawl-hit rate (83 % for web pages and 81 % for deep web databases) with a lesser over head when compared to the baseline.

Suggested Citation

  • G. Pavai & T. V. Geetha, 0. "Improving the freshness of the search engines by a probabilistic approach based incremental crawler," Information Systems Frontiers, Springer, vol. 0, pages 1-16.
  • Handle: RePEc:spr:infosf:v::y::i::d:10.1007_s10796-016-9701-7
    DOI: 10.1007/s10796-016-9701-7
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10796-016-9701-7
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10796-016-9701-7?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. H. Arafat Ali & Ali I. El Desouky & Ahmed I. Saleh, 2008. "A New Approach for Building a Scalable and Adaptive Vertical Search Engine," International Journal of Intelligent Information Technologies (IJIIT), IGI Global, vol. 4(1), pages 52-79, January.
    2. Jongwoo Kim & Veda C. Storey, 2011. "Construction of Domain Ontologies: Sourcing the World Wide Web," International Journal of Intelligent Information Technologies (IJIIT), IGI Global, vol. 7(2), pages 1-24, April.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Vijayan Sugumaran & T. V. Geetha & D. Manjula & Hema Gopal, 2017. "Guest Editorial: Computational Intelligence and Applications," Information Systems Frontiers, Springer, vol. 19(5), pages 969-974, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. G. Pavai & T. V. Geetha, 2017. "Improving the freshness of the search engines by a probabilistic approach based incremental crawler," Information Systems Frontiers, Springer, vol. 19(5), pages 1013-1028, October.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:infosf:v::y::i::d:10.1007_s10796-016-9701-7. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.