IDEAS home Printed from
   My bibliography  Save this article

An Efficient Algorithm for Data Cleaning


  • Payal Pahwa

    (Guru Gobind Singh IndraPrastha University, India)

  • Rajiv Arora

    (Guru Gobind Singh IndraPrastha University, India)

  • Garima Thakur

    (Guru Gobind Singh IndraPrastha University, India)


The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.

Suggested Citation

  • Payal Pahwa & Rajiv Arora & Garima Thakur, 2011. "An Efficient Algorithm for Data Cleaning," International Journal of Knowledge-Based Organizations (IJKBO), IGI Global, vol. 1(4), pages 56-71, October.
  • Handle: RePEc:igg:jkbo00:v:1:y:2011:i:4:p:56-71

    Download full text from publisher

    File URL:
    Download Restriction: no

    More about this item


    Access and download statistics


    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:igg:jkbo00:v:1:y:2011:i:4:p:56-71. See general information about how to correct material in RePEc.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Journal Editor). General contact details of provider: .

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis . RePEc uses bibliographic data supplied by the respective publishers.