IDEAS home Printed from https://ideas.repec.org/a/jss/jstsof/v067c01.html
   My bibliography  Save this article

The Forward Search for Very Large Datasets

Author

Listed:
  • Riani, Marco
  • Perrotta, Domenico
  • Cerioli, Andrea

Abstract

The identification of atypical observations and the immunization of data analysis against both outliers and failures of modeling are important aspects of modern statistics. The forward search is a graphics rich approach that leads to the formal detection of outliers and to the detection of model inadequacy combined with suggestions for model enhancement. The key idea is to monitor quantities of interest, such as parameter estimates and test statistics, as the model is fitted to data subsets of increasing size. In this paper we propose some computational improvements of the forward search algorithm and we provide a recursive implementation of the procedure which exploits the information of the previous step. The output is a set of efficient routines for fast updating of the model parameter estimates, which do not require any data sorting, and fast computation of likelihood contributions, which do not require matrix inversion or qr decomposition. It is shown that the new algorithms enable a reduction of the computation time by more than 80%. Furthemore, the running time now increases almost linearly with the sample size. All the routines described in this paper are included in the FSDA toolbox for MATLAB which is freely downloadable from the internet.

Suggested Citation

  • Riani, Marco & Perrotta, Domenico & Cerioli, Andrea, 2015. "The Forward Search for Very Large Datasets," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 67(c01).
  • Handle: RePEc:jss:jstsof:v:067:c01
    DOI: http://hdl.handle.net/10.18637/jss.v067.c01
    as

    Download full text from publisher

    File URL: https://www.jstatsoft.org/index.php/jss/article/view/v067c01/v67c01.pdf
    Download Restriction: no

    File URL: https://libkey.io/http://hdl.handle.net/10.18637/jss.v067.c01?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Marco Riani & Anthony C. Atkinson & Andrea Cerioli, 2009. "Finding an unknown number of multivariate outliers," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(2), pages 447-466, April.
    2. Zani, Sergio & Riani, Marco & Corbellini, Aldo, 1998. "Robust bivariate boxplots and multiple outlier detection," Computational Statistics & Data Analysis, Elsevier, vol. 28(3), pages 257-270, September.
    3. Andrea Cerioli & Domenico Perrotta, 2014. "Robust clustering around regression lines with high density regions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(1), pages 5-26, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Lisa Crosato & Luigi Grossi, 2019. "Correcting outliers in GARCH models: a weighted forward approach," Statistical Papers, Springer, vol. 60(6), pages 1939-1970, December.
    2. Marco Riani & Andrea Cerioli & Francesca Torti, 2014. "On consistency factors and efficiency of robust S-estimators," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 23(2), pages 356-387, June.
    3. Francesca Torti & Domenico Perrotta & Marco Riani & Andrea Cerioli, 2019. "Assessing trimming methodologies for clustering linear regression data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 227-257, March.
    4. Anthony C. Atkinson & Marco Riani & Andrea Cerioli, 2018. "Cluster detection and clustering with random start forward searches," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(5), pages 777-798, April.
    5. Andrea Cerioli & Marco Riani & Anthony C. Atkinson & Aldo Corbellini, 2018. "The power of monitoring: how to make the most of a contaminated multivariate sample," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 27(4), pages 559-587, December.
    6. Domenico Perrotta & Francesca Torti, 2018. "Discussion of “The power of monitoring: how to make the most of a contaminated multivariate sample”," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 27(4), pages 641-649, December.
    7. Valentin Todorov, 2018. "Discussion of “The power of monitoring: how to make the most of a contaminated multivariate sample” by Andrea Cerioli, Marco Riani, Anthony C. Atkinson and Aldo Corbellini," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 27(4), pages 631-639, December.
    8. Anthony C. Atkinson & Aldo Corbellini & Marco Riani, 2017. "Robust Bayesian regression with the forward search: theory and data analysis," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 26(4), pages 869-886, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Cerioli, Andrea & Farcomeni, Alessio & Riani, Marco, 2014. "Strong consistency and robustness of the Forward Search estimator of multivariate location and scatter," Journal of Multivariate Analysis, Elsevier, vol. 126(C), pages 167-183.
    2. Francesca Torti & Domenico Perrotta & Marco Riani & Andrea Cerioli, 2019. "Assessing trimming methodologies for clustering linear regression data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(1), pages 227-257, March.
    3. Silvia Salini & Andrea Cerioli & Fabrizio Laurini & Marco Riani, 2016. "Reliable Robust Regression Diagnostics," International Statistical Review, International Statistical Institute, vol. 84(1), pages 99-127, April.
    4. Luigi Grossi & Fabrizio Laurini, 2011. "Robust estimation of efficient mean–variance frontiers," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 5(1), pages 3-22, April.
    5. Anthony C. Atkinson & Marco Riani & Andrea Cerioli, 2018. "Cluster detection and clustering with random start forward searches," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(5), pages 777-798, April.
    6. Søren Johansen & Lukasz Gatarek, 2014. "Optimal hedging with the cointegrated vector autoregressive model," CREATES Research Papers 2014-40, Department of Economics and Business Economics, Aarhus University.
    7. Anthony C. Atkinson & Aldo Corbellini & Marco Riani, 2017. "Robust Bayesian regression with the forward search: theory and data analysis," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 26(4), pages 869-886, December.
    8. Torti, Francesca & Corbellini, Aldo & Atkinson, Anthony C., 2021. "fsdaSAS: a package for robust regression for very large datasets including the batch forward search," LSE Research Online Documents on Economics 109895, London School of Economics and Political Science, LSE Library.
    9. Pokojovy, Michael & Jobe, J. Marcus, 2022. "A robust deterministic affine-equivariant algorithm for multivariate location and scatter," Computational Statistics & Data Analysis, Elsevier, vol. 172(C).
    10. Aldo Corbellini & Marco Riani & Anthony Atkinson, 2015. "Hubert, Rousseeuw and Segaert: multivariate functional outlier detection," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 24(2), pages 257-261, July.
    11. Anthony C. Atkinson & Marco Riani & Aldo Corbellini, 2020. "The analysis of transformations for profit‐and‐loss data," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 69(2), pages 251-275, April.
    12. Arismendi, Juan C. & Broda, Simon, 2017. "Multivariate elliptical truncated moments," Journal of Multivariate Analysis, Elsevier, vol. 157(C), pages 29-44.
    13. Salvatore Ingrassia & Simona Minotti & Giorgio Vittadini, 2012. "Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical Distributions," Journal of Classification, Springer;The Classification Society, vol. 29(3), pages 363-401, October.
    14. Domenico Perrotta & Marco Riani & Francesca Torti, 2009. "New robust dynamic plots for regression mixture detection," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 3(3), pages 263-279, December.
    15. Menjoge, Rajiv S. & Welsch, Roy E., 2010. "A diagnostic method for simultaneous feature selection and outlier identification in linear regression," Computational Statistics & Data Analysis, Elsevier, vol. 54(12), pages 3181-3193, December.
    16. Luca Greco & Giovanni Saraceno & Claudio Agostinelli, 2021. "Robust Fitting of a Wrapped Normal Model to Multivariate Circular Data and Outlier Detection," Stats, MDPI, vol. 4(2), pages 1-18, June.
    17. Zuppiroli, Marco & Donati, Michele & Riani, Marco & Verga, Giovanni, 2015. "The Impact of Trading Activity in Agricultural Futures Markets," 2015 Fourth Congress, June 11-12, 2015, Ancona, Italy 207848, Italian Association of Agricultural and Applied Economics (AIEAA).
    18. Kirschstein, Thomas & Liebscher, Steffen & Becker, Claudia, 2013. "Robust estimation of location and scatter by pruning the minimum spanning tree," Journal of Multivariate Analysis, Elsevier, vol. 120(C), pages 173-184.
    19. Marco Riani & Andrea Cerioli & Francesca Torti, 2014. "On consistency factors and efficiency of robust S-estimators," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 23(2), pages 356-387, June.
    20. Anthony Atkinson & Marco Riani, 2004. "The forward search and data visualisation," Computational Statistics, Springer, vol. 19(1), pages 29-54, February.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:jss:jstsof:v:067:c01. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Christopher F. Baum (email available below). General contact details of provider: http://www.jstatsoft.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.