IDEAS home Printed from https://ideas.repec.org/a/spr/annopr/v174y2010i1p47-6610.1007-s10479-008-0494-z.html
   My bibliography  Save this article

A framework of irregularity enlightenment for data pre-processing in data mining

Author

Listed:
  • Siu-Tong Au
  • Rong Duan
  • Siamak Hesar
  • Wei Jiang

Abstract

Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis. Copyright Springer Science+Business Media, LLC 2010

Suggested Citation

  • Siu-Tong Au & Rong Duan & Siamak Hesar & Wei Jiang, 2010. "A framework of irregularity enlightenment for data pre-processing in data mining," Annals of Operations Research, Springer, vol. 174(1), pages 47-66, February.
  • Handle: RePEc:spr:annopr:v:174:y:2010:i:1:p:47-66:10.1007/s10479-008-0494-z
    DOI: 10.1007/s10479-008-0494-z
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1007/s10479-008-0494-z
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1007/s10479-008-0494-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Bianco, Ana Maria, et al, 2001. "Outlier Detection in Regression Models with ARIMA Errors Using Robust Estimates," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 20(8), pages 565-579, December.
    2. P. M. Lerman, 1980. "Fitting Segmented Regression Models by Grid Search," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 29(1), pages 77-84, March.
    3. Lavielle, Marc, 1999. "Detection of multiple changes in a sequence of dependent variables," Stochastic Processes and their Applications, Elsevier, vol. 83(1), pages 79-102, September.
    4. Hawkins, Douglas M., 2001. "Fitting multiple change-point models to data," Computational Statistics & Data Analysis, Elsevier, vol. 37(3), pages 323-341, September.
    5. Alwan, Layth C & Roberts, Harry V, 1988. "Time-Series Modeling for Statistical Process Control," Journal of Business & Economic Statistics, American Statistical Association, vol. 6(1), pages 87-95, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Rania Jammazi & Duc Khuong Nguyen, 2017. "Estimating and forecasting portfolio’s Value-at-Risk with wavelet-based extreme value theory: Evidence from crude oil prices and US exchange rates," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 68(11), pages 1352-1362, November.
    2. Jianxiong Gao & Zongwen An & Xuezong Bai, 2022. "A new representation method for probability distributions of multimodal and irregular data based on uniform mixture model," Annals of Operations Research, Springer, vol. 311(1), pages 81-97, April.
    3. Mark Gilchrist & Deana Lehmann Mooers & Glenn Skrubbeltrang & Francine Vachon, 2012. "Knowledge Discovery in Databases for Competitive Advantage," Journal of Management and Strategy, Journal of Management and Strategy, Sciedu Press, vol. 3(2), pages 2-15, April.
    4. George Chalamandaris & Nikos E. Vlachogiannakis, 2018. "Are financial ratios relevant for trading credit risk? Evidence from the CDS market," Annals of Operations Research, Springer, vol. 266(1), pages 395-440, July.
    5. Jammazi, Rania & Aloui, Chaker, 2015. "On the interplay between energy consumption, economic growth and CO2 emission nexus in the GCC countries: A comparative analysis through wavelet approaches," Renewable and Sustainable Energy Reviews, Elsevier, vol. 51(C), pages 1737-1751.
    6. Aloui, Chaker & Jammazi, Rania, 2015. "Dependence and risk assessment for oil prices and exchange rate portfolios: A wavelet based approach," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 436(C), pages 62-86.
    7. Sun, Edward W. & Meinl, Thomas, 2012. "A new wavelet-based denoising algorithm for high-frequency financial data mining," European Journal of Operational Research, Elsevier, vol. 217(3), pages 589-599.
    8. Jammazi, Rania & Reboredo, Juan C., 2016. "Dependence and risk management in oil and stock markets. A wavelet-copula analysis," Energy, Elsevier, vol. 107(C), pages 866-888.
    9. Mahdi Massahi & Masoud Mahootchi & Alireza Arshadi Khamseh, 2020. "Development of an efficient cluster-based portfolio optimization model under realistic market conditions," Empirical Economics, Springer, vol. 59(5), pages 2423-2442, November.
    10. Sun, Edward W. & Chen, Yi-Ting & Yu, Min-Teh, 2015. "Generalized optimal wavelet decomposing algorithm for big financial data," International Journal of Production Economics, Elsevier, vol. 165(C), pages 194-214.
    11. Asil Oztekin, 2018. "Creating a marketing strategy in healthcare industry: a holistic data analytic approach," Annals of Operations Research, Springer, vol. 270(1), pages 361-382, November.
    12. Jammazi, Rania & Aloui, Chaker, 2015. "Environment degradation, economic growth and energy consumption nexus: A wavelet-windowed cross correlation approach," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 436(C), pages 110-125.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Salvatore Fasola & Vito M. R. Muggeo & Helmut Küchenhoff, 2018. "A heuristic, iterative algorithm for change-point detection in abrupt change models," Computational Statistics, Springer, vol. 33(2), pages 997-1015, June.
    2. Samari, Goleen & Catalano, Ralph & Alcalá, Héctor E. & Gemmill, Alison, 2020. "The Muslim Ban and preterm birth: Analysis of U.S. vital statistics data from 2009 to 2018," Social Science & Medicine, Elsevier, vol. 265(C).
    3. Paul Fogel & Yann Gaston-Mathé & Douglas Hawkins & Fajwel Fogel & George Luta & S. Stanley Young, 2016. "Applications of a Novel Clustering Approach Using Non-Negative Matrix Factorization to Environmental Research in Public Health," IJERPH, MDPI, vol. 13(5), pages 1-14, May.
    4. Surgailis, Donatas & Teyssière, Gilles & Vaiciulis, Marijus, 2008. "The increment ratio statistic," Journal of Multivariate Analysis, Elsevier, vol. 99(3), pages 510-541, March.
    5. Marta Benková & Dagmar Bednárová & Gabriela Bogdanovská & Marcela Pavlíčková, 2023. "Use of Statistical Process Control for Coking Time Monitoring," Mathematics, MDPI, vol. 11(16), pages 1-30, August.
    6. Johannes Freiesleben & Nicolas Gu'erin, 2015. "Homogenization and Clustering as a Non-Statistical Methodology to Assess Multi-Parametrical Chain Problems," Papers 1505.03874, arXiv.org, revised Dec 2017.
    7. Adham Alsharkawi & Mohammad Al-Fetyani & Maha Dawas & Heba Saadeh & Musa Alyaman, 2021. "Poverty Classification Using Machine Learning: The Case of Jordan," Sustainability, MDPI, vol. 13(3), pages 1-16, January.
    8. Yann Guédon, 2013. "Exploring the latent segmentation space for the assessment of multiple change-point models," Computational Statistics, Springer, vol. 28(6), pages 2641-2678, December.
    9. Miguel Flores & Salvador Naya & Rubén Fernández-Casal & Sonia Zaragoza & Paula Raña & Javier Tarrío-Saavedra, 2020. "Constructing a Control Chart Using Functional Data," Mathematics, MDPI, vol. 8(1), pages 1-26, January.
    10. Timothy M. Young & Ampalavanar Nanthakumar & Hari Nanthakumar, 2021. "On the Use of Copula for Quality Control Based on an AR(1) Model," Mathematics, MDPI, vol. 9(18), pages 1-13, September.
    11. Thaga K. & Kgosi P. M. & Gabaitiri L., 2007. "Max-Chart for Autocorrelated Processes," Stochastics and Quality Control, De Gruyter, vol. 22(1), pages 87-105, January.
    12. Kang-Ping Lu & Shao-Tung Chang, 2021. "Robust Algorithms for Change-Point Regressions Using the t -Distribution," Mathematics, MDPI, vol. 9(19), pages 1-28, September.
    13. Fontaine, Charles & Frostig, Ron D. & Ombao, Hernando, 2020. "Modeling non-linear spectral domain dependence using copulas with applications to rat local field potentials," Econometrics and Statistics, Elsevier, vol. 15(C), pages 85-103.
    14. Tan, Xiujie & Xiao, Ziwei & Liu, Yishuang & Taghizadeh-Hesary, Farhad & Wang, Banban & Dong, Hanmin, 2022. "The effect of green credit policy on energy efficiency: Evidence from China," Technological Forecasting and Social Change, Elsevier, vol. 183(C).
    15. Suwon Song & Chun Gun Park, 2019. "Alternative Algorithm for Automatically Driving Best-Fit Building Energy Baseline Models Using a Data—Driven Grid Search," Sustainability, MDPI, vol. 11(24), pages 1-11, December.
    16. Mondher Bellalah & Marc Lavielle, 2002. "A Decomposition of Empirical Distributions with Applications to the Valuation of Derivative Assets," Multinational Finance Journal, Multinational Finance Journal, vol. 6(2), pages 99-130, June.
    17. Marczak, Martyna & Proietti, Tommaso & Grassi, Stefano, 2018. "A data-cleaning augmented Kalman filter for robust estimation of state space models," Econometrics and Statistics, Elsevier, vol. 5(C), pages 107-123.
    18. Tahira Kootbodien & Nisha Naicker & Kerry S. Wilson & Raj Ramesar & Leslie London, 2020. "Trends in Suicide Mortality in South Africa, 1997 to 2016," IJERPH, MDPI, vol. 17(6), pages 1-16, March.
    19. Venkata Jandhyala & Stergios Fotopoulos & Ian MacNeill & Pengyu Liu, 2013. "Inference for single and multiple change-points in time series," Journal of Time Series Analysis, Wiley Blackwell, vol. 34(4), pages 423-446, July.
    20. Jonathan Readshaw & Stefano Giani, 2020. "Using Company Specific Headlines and Convolutional Neural Networks to Predict Stock Fluctuations," Papers 2006.12426, arXiv.org.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:annopr:v:174:y:2010:i:1:p:47-66:10.1007/s10479-008-0494-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.