Author
Listed:
- Pedro Martins
(Research Center in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal)
- Filipe Cardoso
(Polytechnic Institute of Santarém, Escola Superior de Gestão e Tecnologia de Santarém, 2001-904 Santarém, Portugal)
- Paulo Váz
(Research Center in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal)
- José Silva
(Research Center in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal)
- Maryam Abbasi
(Applied Research Institute, Polytechnic of Coimbra, 3045-093 Coimbra, Portugal)
Abstract
Data cleaning remains one of the most time-consuming and critical steps in modern data science, directly influencing the reliability and accuracy of downstream analytics. In this paper, we present a comprehensive evaluation of five widely used data cleaning tools—OpenRefine, Dedupe, Great Expectations, TidyData (PyJanitor), and a baseline Pandas pipeline—applied to large-scale, messy datasets spanning three domains (healthcare, finance, and industrial telemetry). We benchmark each tool on dataset sizes ranging from 1 million to 100 million records, measuring execution time, memory usage, error detection accuracy, and scalability under increasing data volumes. Additionally, we assess qualitative aspects such as usability and ease of integration, reflecting real-world adoption concerns. We incorporate recent findings on parallelized data cleaning and highlight how domain-specific anomalies (e.g., negative amounts in finance, sensor corruption in industrial telemetry) can significantly impact tool choice. Our findings reveal that no single solution excels across all metrics; while Dedupe provides robust duplicate detection and Great Expectations offers in-depth rule-based validation, tools like TidyData and baseline Pandas pipelines demonstrate strong scalability and flexibility under chunk-based ingestion. The choice of tool ultimately depends on domain-specific requirements (e.g., approximate matching in finance and strict auditing in healthcare) and the magnitude of available computational resources. By highlighting each framework’s strengths and limitations, this study offers data practitioners clear, evidence-driven guidance for selecting and combining tools to tackle large-scale data cleaning challenges.
Suggested Citation
Pedro Martins & Filipe Cardoso & Paulo Váz & José Silva & Maryam Abbasi, 2025.
"Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets,"
Data, MDPI, vol. 10(5), pages 1-22, May.
Handle:
RePEc:gam:jdataj:v:10:y:2025:i:5:p:68-:d:1649421
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:10:y:2025:i:5:p:68-:d:1649421. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.