IDEAS home Printed from https://ideas.repec.org/a/spr/annopr/v174y2010i1p47-6610.1007-s10479-008-0494-z.html
   My bibliography  Save this article

A framework of irregularity enlightenment for data pre-processing in data mining

Author

Listed:
  • Siu-Tong Au
  • Rong Duan
  • Siamak Hesar
  • Wei Jiang

Abstract

Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis. Copyright Springer Science+Business Media, LLC 2010

Suggested Citation

  • Siu-Tong Au & Rong Duan & Siamak Hesar & Wei Jiang, 2010. "A framework of irregularity enlightenment for data pre-processing in data mining," Annals of Operations Research, Springer, vol. 174(1), pages 47-66, February.
  • Handle: RePEc:spr:annopr:v:174:y:2010:i:1:p:47-66:10.1007/s10479-008-0494-z
    DOI: 10.1007/s10479-008-0494-z
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1007/s10479-008-0494-z
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1007/s10479-008-0494-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. P. M. Lerman, 1980. "Fitting Segmented Regression Models by Grid Search," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 29(1), pages 77-84, March.
    2. Bianco, Ana Maria, et al, 2001. "Outlier Detection in Regression Models with ARIMA Errors Using Robust Estimates," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 20(8), pages 565-579, December.
    3. Lavielle, Marc, 1999. "Detection of multiple changes in a sequence of dependent variables," Stochastic Processes and their Applications, Elsevier, vol. 83(1), pages 79-102, September.
    4. Hawkins, Douglas M., 2001. "Fitting multiple change-point models to data," Computational Statistics & Data Analysis, Elsevier, vol. 37(3), pages 323-341, September.
    5. Alwan, Layth C & Roberts, Harry V, 1988. "Time-Series Modeling for Statistical Process Control," Journal of Business & Economic Statistics, American Statistical Association, vol. 6(1), pages 87-95, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Rania Jammazi & Duc Khuong Nguyen, 2017. "Estimating and forecasting portfolio’s Value-at-Risk with wavelet-based extreme value theory: Evidence from crude oil prices and US exchange rates," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 68(11), pages 1352-1362, November.
    2. Jianxiong Gao & Zongwen An & Xuezong Bai, 2022. "A new representation method for probability distributions of multimodal and irregular data based on uniform mixture model," Annals of Operations Research, Springer, vol. 311(1), pages 81-97, April.
    3. Mark Gilchrist & Deana Lehmann Mooers & Glenn Skrubbeltrang & Francine Vachon, 2012. "Knowledge Discovery in Databases for Competitive Advantage," Journal of Management and Strategy, Journal of Management and Strategy, Sciedu Press, vol. 3(2), pages 2-15, April.
    4. George Chalamandaris & Nikos E. Vlachogiannakis, 2018. "Are financial ratios relevant for trading credit risk? Evidence from the CDS market," Annals of Operations Research, Springer, vol. 266(1), pages 395-440, July.
    5. Jammazi, Rania & Aloui, Chaker, 2015. "On the interplay between energy consumption, economic growth and CO2 emission nexus in the GCC countries: A comparative analysis through wavelet approaches," Renewable and Sustainable Energy Reviews, Elsevier, vol. 51(C), pages 1737-1751.
    6. Aloui, Chaker & Jammazi, Rania, 2015. "Dependence and risk assessment for oil prices and exchange rate portfolios: A wavelet based approach," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 436(C), pages 62-86.
    7. Sun, Edward W. & Meinl, Thomas, 2012. "A new wavelet-based denoising algorithm for high-frequency financial data mining," European Journal of Operational Research, Elsevier, vol. 217(3), pages 589-599.
    8. Jammazi, Rania & Reboredo, Juan C., 2016. "Dependence and risk management in oil and stock markets. A wavelet-copula analysis," Energy, Elsevier, vol. 107(C), pages 866-888.
    9. Mahdi Massahi & Masoud Mahootchi & Alireza Arshadi Khamseh, 2020. "Development of an efficient cluster-based portfolio optimization model under realistic market conditions," Empirical Economics, Springer, vol. 59(5), pages 2423-2442, November.
    10. Sun, Edward W. & Chen, Yi-Ting & Yu, Min-Teh, 2015. "Generalized optimal wavelet decomposing algorithm for big financial data," International Journal of Production Economics, Elsevier, vol. 165(C), pages 194-214.
    11. Asil Oztekin, 2018. "Creating a marketing strategy in healthcare industry: a holistic data analytic approach," Annals of Operations Research, Springer, vol. 270(1), pages 361-382, November.
    12. Jammazi, Rania & Aloui, Chaker, 2015. "Environment degradation, economic growth and energy consumption nexus: A wavelet-windowed cross correlation approach," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 436(C), pages 110-125.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Salvatore Fasola & Vito M. R. Muggeo & Helmut Küchenhoff, 2018. "A heuristic, iterative algorithm for change-point detection in abrupt change models," Computational Statistics, Springer, vol. 33(2), pages 997-1015, June.
    2. Bill Russell & Dooruj Rambaccussing, 2019. "Breaks and the statistical process of inflation: the case of estimating the ‘modern’ long-run Phillips curve," Empirical Economics, Springer, vol. 56(5), pages 1455-1475, May.
    3. Roberts, Leigh, 2014. "Consistent estimation of breakpoints in time series, with application to wavelet analysis of Citigroup returns," Working Paper Series 18815, Victoria University of Wellington, School of Economics and Finance.
    4. Samari, Goleen & Catalano, Ralph & Alcalá, Héctor E. & Gemmill, Alison, 2020. "The Muslim Ban and preterm birth: Analysis of U.S. vital statistics data from 2009 to 2018," Social Science & Medicine, Elsevier, vol. 265(C).
    5. Paul Fogel & Yann Gaston-Mathé & Douglas Hawkins & Fajwel Fogel & George Luta & S. Stanley Young, 2016. "Applications of a Novel Clustering Approach Using Non-Negative Matrix Factorization to Environmental Research in Public Health," IJERPH, MDPI, vol. 13(5), pages 1-14, May.
    6. Amira Dridi & Mohamed El Ghourabi & Mohamed Limam, 2012. "On monitoring financial stress index with extreme value theory," Quantitative Finance, Taylor & Francis Journals, vol. 12(3), pages 329-339, March.
    7. Weihs, Claus & Theis, Winfried & Messaoud, Amor & Hering, Franz, 2004. "Monitoring of the BTA Deep Hole Drilling Process Using Residual Control Charts," Technical Reports 2004,60, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    8. Surgailis, Donatas & Teyssière, Gilles & Vaiciulis, Marijus, 2008. "The increment ratio statistic," Journal of Multivariate Analysis, Elsevier, vol. 99(3), pages 510-541, March.
    9. Marta Benková & Dagmar Bednárová & Gabriela Bogdanovská & Marcela Pavlíčková, 2023. "Use of Statistical Process Control for Coking Time Monitoring," Mathematics, MDPI, vol. 11(16), pages 1-30, August.
    10. Fan, Xudong & Wang, Xiaowei & Zhang, Xijin & ASCE Xiong (Bill) Yu, P.E.F., 2022. "Machine learning based water pipe failure prediction: The effects of engineering, geology, climate and socio-economic factors," Reliability Engineering and System Safety, Elsevier, vol. 219(C).
    11. Johannes Freiesleben & Nicolas Gu'erin, 2015. "Homogenization and Clustering as a Non-Statistical Methodology to Assess Multi-Parametrical Chain Problems," Papers 1505.03874, arXiv.org, revised Dec 2017.
    12. Adham Alsharkawi & Mohammad Al-Fetyani & Maha Dawas & Heba Saadeh & Musa Alyaman, 2021. "Poverty Classification Using Machine Learning: The Case of Jordan," Sustainability, MDPI, vol. 13(3), pages 1-16, January.
    13. Yann Guédon, 2013. "Exploring the latent segmentation space for the assessment of multiple change-point models," Computational Statistics, Springer, vol. 28(6), pages 2641-2678, December.
    14. Ben Q. Liu & Dale L. Goodhue, 2012. "Two Worlds of Trust for Potential E-Commerce Users: Humans as Cognitive Misers," Information Systems Research, INFORMS, vol. 23(4), pages 1246-1262, December.
    15. Davis, Richard A. & Hancock, Stacey A. & Yao, Yi-Ching, 2016. "On consistency of minimum description length model selection for piecewise autoregressions," Journal of Econometrics, Elsevier, vol. 194(2), pages 360-368.
    16. Miguel Flores & Salvador Naya & Rubén Fernández-Casal & Sonia Zaragoza & Paula Raña & Javier Tarrío-Saavedra, 2020. "Constructing a Control Chart Using Functional Data," Mathematics, MDPI, vol. 8(1), pages 1-26, January.
    17. Timothy M. Young & Ampalavanar Nanthakumar & Hari Nanthakumar, 2021. "On the Use of Copula for Quality Control Based on an AR(1) Model," Mathematics, MDPI, vol. 9(18), pages 1-13, September.
    18. Shi, Xiaoping & Wu, Yuehua & Miao, Baiqi, 2009. "Strong convergence rate of estimators of change point and its application," Computational Statistics & Data Analysis, Elsevier, vol. 53(4), pages 990-998, February.
    19. Thaga K. & Kgosi P. M. & Gabaitiri L., 2007. "Max-Chart for Autocorrelated Processes," Stochastics and Quality Control, De Gruyter, vol. 22(1), pages 87-105, January.
    20. Fryzlewicz, Piotr, 2020. "Detecting possibly frequent change-points: Wild Binary Segmentation 2 and steepest-drop model selection," LSE Research Online Documents on Economics 103430, London School of Economics and Political Science, LSE Library.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:annopr:v:174:y:2010:i:1:p:47-66:10.1007/s10479-008-0494-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.