Robust machine learning by median-of-means : theory and practice

My bibliography Save this paper

Robust machine learning by median-of-means : theory and practice

Author

Listed:

Guillaume Lecué
(CREST; CNRS; Université Paris Saclay)
Mathieu Lerasle
(CNRS,département de mathématiques d’Orsay)

Registered:

Abstract

We introduce new estimators for robust machine learning based on median-of-means (MOM) estimators of the mean of real valued random variables. These estimators achieve optimal rates of convergence under minimal assumptions on the dataset. The dataset may also have been corrupted by outliers on which no assumption is granted. We also analyze these new estimators with standard tools from robust statistics. In particular, we revisit the concept of breakdown point. We modify the original definition by studying the number of outliers that a dataset can contain without deteriorating the estimation properties of a given estimator. This new notion of breakdown number, that takes into account the statistical performances of the estimators, is non-asymptotic in nature and adapted for machine learning purposes. We proved that the breakdown number of our estimator is of the order of number of observations * rate of convergence. For instance, the breakdown number of our estimators for the problem of estimation of a d-dimensional vector with a noise variance a² is a²d and it becomes a²s log(ed/s) when this vector has only s non-zero component. Beyond this breakdown point, we proved that the rate of convergence achieved by our estimator is number of outliers divided by number of observations. Besides these theoretical guarantees, the major improvement brought by these new estimators is that they are easily computable in practice. In fact, basically any algorithm used to approximate the standard Empirical Risk Minimizer (or its regularized versions) has a robust version approximating our estimators. On top of being robust to outliers, the "MOM version" of the algorithms are even faster than the original ones, less demanding in memory resources in some situations and well adapted for distributed datasets which makes it particularly attractive for large dataset analysis. As a proof of concept, we study many algorithms for the classical LASSO estimator. It turns out that the original algorithm can be improved a lot in practice by randomizing the blocks on which \local means" are computed at each step of the descent algorithm. A byproduct of this modification is that our algorithms come with a measure of depth of data that can be used to detect outliers, which is another major issue in Machine learning.

Suggested Citation

Guillaume Lecué & Mathieu Lerasle, 2017. "Robust machine learning by median-of-means : theory and practice," Working Papers 2017-32, Center for Research in Economics and Statistics.

Handle: RePEc:crs:wpaper:2017-32

Download full text from publisher

References listed on IDEAS

Baraud, Y. & Birgé, L., 2016. "Rho-estimators for shape restricted density estimation," Stochastic Processes and their Applications, Elsevier, vol. 126(12), pages 3888-3912.
Sara Geer, 2014. "Weakly decomposable regularization penalties and structured sparsity," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 41(1), pages 72-86, March.
Dodge, Yadolah, 1987. "An introduction to L1-norm based statistical data analysis," Computational Statistics & Data Analysis, Elsevier, vol. 5(4), pages 239-253, September.
Jianqing Fan & Quefeng Li & Yuyan Wang, 2017. "Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(1), pages 247-265, January.
Cun-Hui Zhang & Stephanie S. Zhang, 2014. "Confidence intervals for low dimensional parameters in high dimensional linear models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 76(1), pages 217-242, January.

Full references (including those not matched with items on IDEAS)

Citations

Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

Cited by:

Pengfei Liu & Mengchen Zhang & Ru Zhang & Qin Zhou, 2021. "Robust Estimation and Tests for Parameters of Some Nonlinear Regression Models," Mathematics, MDPI, vol. 9(6), pages 1-16, March.
Adarsh Prasad & Arun Sai Suggala & Sivaraman Balakrishnan & Pradeep Ravikumar, 2020. "Robust estimation via robust gradient estimation," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 82(3), pages 601-627, July.

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Han, Dongxiao & Huang, Jian & Lin, Yuanyuan & Shen, Guohao, 2022. "Robust post-selection inference of high-dimensional mean regression with heavy-tailed asymmetric or heteroskedastic errors," Journal of Econometrics, Elsevier, vol. 230(2), pages 416-431.
Luo, Jiyu & Sun, Qiang & Zhou, Wen-Xin, 2022. "Distributed adaptive Huber regression," Computational Statistics & Data Analysis, Elsevier, vol. 169(C).
Lecué, Guillaume & Lerasle, Matthieu, 2019. "Learning from MOM’s principles: Le Cam’s approach," Stochastic Processes and their Applications, Elsevier, vol. 129(11), pages 4385-4410.
van de Geer, Sara, 2016. "Worst possible sub-directions in high-dimensional models," Journal of Multivariate Analysis, Elsevier, vol. 146(C), pages 248-260.
Ciuperca, Gabriela, 2021. "Variable selection in high-dimensional linear model with possibly asymmetric errors," Computational Statistics & Data Analysis, Elsevier, vol. 155(C).
Alexandre Belloni & Victor Chernozhukov & Denis Chetverikov & Christian Hansen & Kengo Kato, 2018. "High-dimensional econometrics and regularized GMM," CeMMAP working papers CWP35/18, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
- Alexandre Belloni & Victor Chernozhukov & Denis Chetverikov & Christian Hansen & Kengo Kato, 2018. "High-Dimensional Econometrics and Regularized GMM," Papers 1806.01888, arXiv.org, revised Jun 2018.
Alexandre Belloni & Victor Chernozhukov & Kengo Kato, 2019. "Valid Post-Selection Inference in High-Dimensional Approximately Sparse Quantile Regression Models," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 114(526), pages 749-758, April.
- Alexandre Belloni & Victor Chernozhukov & Kengo Kato, 2013. "Valid Post-Selection Inference in High-Dimensional Approximately Sparse Quantile Regression Models," Papers 1312.7186, arXiv.org, revised Jun 2016.
- Alexandre Belloni & Victor Chernozhukov & Kengo Kato, 2014. "Valid post-selection inference in high-dimensional approximately sparse quantile regression models," CeMMAP working papers CWP53/14, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
- Alexandre Belloni & Victor Chernozhukov & Kengo Kato, 2014. "Valid post-selection inference in high-dimensional approximately sparse quantile regression models," CeMMAP working papers 53/14, Institute for Fiscal Studies.
Cabana Garceran del Vall, Elisa & Laniado Rodas, Henry & Lillo Rodríguez, Rosa Elvira, 2017. "Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators," DES - Working Papers. Statistics and Econometrics. WS 24613, Universidad Carlos III de Madrid. Departamento de EstadÃstica.
Susan Athey & Guido W. Imbens & Stefan Wager, 2018. "Approximate residual balancing: debiased inference of average treatment effects in high dimensions," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 80(4), pages 597-623, September.
- Susan Athey & Guido W. Imbens & Stefan Wager, 2016. "Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions," Papers 1604.07125, arXiv.org, revised Jan 2018.
Jelena Bradic & Weijie Ji & Yuqian Zhang, 2021. "High-dimensional Inference for Dynamic Treatment Effects," Papers 2110.04924, arXiv.org, revised May 2023.
Chenchuan (Mark) Li & Ulrich K. Müller, 2021. "Linear regression with many controls of limited explanatory power," Quantitative Economics, Econometric Society, vol. 12(2), pages 405-442, May.
Alexandre Belloni & Victor Chernozhukov & Christian Hansen & Damian Kozbur, 2016. "Inference in High-Dimensional Panel Models With an Application to Gun Control," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 34(4), pages 590-605, October.
- Alexandre Belloni & Victor Chernozhukov & Christian Hansen & Damian Kozbur, 2014. "Inference in high dimensional panel models with an application to gun control," CeMMAP working papers 50/14, Institute for Fiscal Studies.
- Alexandre Belloni & Victor Chernozhukov & Christian Hansen & Damian Kozbur, 2014. "Inference in High Dimensional Panel Models with an Application to Gun Control," Papers 1411.6507, arXiv.org.
- Alexandre Belloni & Victor Chernozhukov & Christian Hansen & Damian Kozbur, 2014. "Inference in high dimensional panel models with an application to gun control," CeMMAP working papers CWP50/14, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
X. Jessie Jeng & Huimin Peng & Wenbin Lu, 2021. "Model Selection With Mixed Variables on the Lasso Path," Sankhya B: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 83(1), pages 170-184, May.
Shengchun Kong & Zhuqing Yu & Xianyang Zhang & Guang Cheng, 2021. "High‐dimensional robust inference for Cox regression models using desparsified Lasso," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 48(3), pages 1068-1095, September.
Umberto Amato & Anestis Antoniadis & Italia De Feis & Irene Gijbels, 2021. "Penalised robust estimators for sparse and high-dimensional linear models," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(1), pages 1-48, March.
Victor Chernozhukov & Whitney K. Newey & Victor Quintas-Martinez & Vasilis Syrgkanis, 2021. "Automatic Debiased Machine Learning via Riesz Regression," Papers 2104.14737, arXiv.org, revised Mar 2024.
Guo, Xu & Li, Runze & Liu, Jingyuan & Zeng, Mudong, 2023. "Statistical inference for linear mediation models with high-dimensional mediators and application to studying stock reaction to COVID-19 pandemic," Journal of Econometrics, Elsevier, vol. 235(1), pages 166-179.
Saulius Jokubaitis & Remigijus Leipus, 2022. "Asymptotic Normality in Linear Regression with Approximately Sparse Structure," Mathematics, MDPI, vol. 10(10), pages 1-28, May.
Caner, Mehmet, 2023. "Generalized linear models with structured sparsity estimators," Journal of Econometrics, Elsevier, vol. 236(2).
- Mehmet Caner, 2021. "Generalized Linear Models with Structured Sparsity Estimators," Papers 2104.14371, arXiv.org.
Stéphane Chrétien & Camille Giampiccolo & Wenjuan Sun & Jessica Talbott, 2021. "Fast Hyperparameter Calibration of Sparsity Enforcing Penalties in Total Generalised Variation Penalised Reconstruction Methods for XCT Using a Planted Virtual Reference Image," Mathematics, MDPI, vol. 9(22), pages 1-12, November.

More about this item

NEP fields

This paper has been announced in the following NEP Reports:

NEP-BIG-2018-02-19 (Big Data)
NEP-CMP-2018-02-19 (Computational Economics)
NEP-ECM-2018-02-19 (Econometrics)

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:crs:wpaper:2017-32. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Secretariat General (email available below). General contact details of provider: https://edirc.repec.org/data/crestfr.html .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Robust machine learning by median-of-means : theory and practice

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Citations

Most related items

More about this item

NEP fields

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data