IDEAS home Printed from https://ideas.repec.org/a/eee/ejores/v297y2022i2p782-794.html
   My bibliography  Save this article

Sparse regression for large data sets with outliers

Author

Listed:
  • Bottmer, Lea
  • Croux, Christophe
  • Wilms, Ines

Abstract

The linear regression model remains an important workhorse for data scientists. However, many data sets contain many more predictors than observations. Besides, outliers, or anomalies, frequently occur. This paper proposes an algorithm for regression analysis that addresses these features typical for big data sets, which we call “sparse shooting S”. The resulting regression coefficients are sparse, meaning that many of them are set to zero, hereby selecting the most relevant predictors. A distinct feature of the method is its robustness with respect to outliers in the cells of the data matrix. The excellent performance of this robust variable selection and prediction method is shown in a simulation study. A real data application on car fuel consumption demonstrates its usefulness.

Suggested Citation

  • Bottmer, Lea & Croux, Christophe & Wilms, Ines, 2022. "Sparse regression for large data sets with outliers," European Journal of Operational Research, Elsevier, vol. 297(2), pages 782-794.
  • Handle: RePEc:eee:ejores:v:297:y:2022:i:2:p:782-794
    DOI: 10.1016/j.ejor.2021.05.049
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S037722172100477X
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ejor.2021.05.049?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Huang, Tao & Fildes, Robert & Soopramanien, Didier, 2014. "The value of competitive information in forecasting FMCG retail product sales and the variable selection problem," European Journal of Operational Research, Elsevier, vol. 237(2), pages 738-748.
    2. Erjie Ang & Sara Kwasnick & Mohsen Bayati & Erica L. Plambeck & Michael Aratow, 2016. "Accurate Emergency Department Wait Time Prediction," Manufacturing & Service Operations Management, INFORMS, vol. 18(1), pages 141-156, February.
    3. Pun, Chi Seng & Wong, Hoi Ying, 2019. "A linear programming model for selection of sparse high-dimensional multiperiod portfolios," European Journal of Operational Research, Elsevier, vol. 273(2), pages 754-771.
    4. Bertsimas, Dimitris & Copenhaver, Martin S., 2018. "Characterization of the equivalence of robustification and regularization in linear and matrix regression," European Journal of Operational Research, Elsevier, vol. 270(3), pages 931-942.
    5. Abolhassani, Amir & James Harner, E. & Jaridi, Majid, 2019. "Empirical analysis of productivity enhancement strategies in the North American automotive industry," International Journal of Production Economics, Elsevier, vol. 208(C), pages 140-159.
    6. Alexandre Belloni & Victor Chernozhukov, 2011. "High Dimensional Sparse Econometric Models: An Introduction," Papers 1106.5242, arXiv.org, revised Sep 2011.
    7. Joki, Kaisa & Bagirov, Adil M. & Karmitsa, Napsu & Mäkelä, Marko M. & Taheri, Sona, 2020. "Clusterwise support vector linear regression," European Journal of Operational Research, Elsevier, vol. 287(1), pages 19-35.
    8. Christophe Croux & Catherine Dehon, 2010. "Influence functions of the Spearman and Kendall correlation measures," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 19(4), pages 497-515, November.
    9. P. Tseng, 2001. "Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization," Journal of Optimization Theory and Applications, Springer, vol. 109(3), pages 475-494, June.
    10. Ma, Shaohui & Fildes, Robert & Huang, Tao, 2016. "Demand forecasting with high dimensional data: The case of SKU retail sales forecasting with intra- and inter-category promotional information," European Journal of Operational Research, Elsevier, vol. 249(1), pages 245-257.
    11. Victor Chernozhukov & Christian Hansen & Martin Spindler, 2015. "Post-Selection and Post-Regularization Inference in Linear Models with Many Controls and Instruments," American Economic Review, American Economic Association, vol. 105(5), pages 486-490, May.
    12. Ghaddar, Bissan & Naoum-Sawaya, Joe, 2018. "High dimensional data classification and feature selection using support vector machines," European Journal of Operational Research, Elsevier, vol. 265(3), pages 993-1004.
    13. Grznar, John & Prasad, Sameer & Tata, Jasmine, 2007. "Neural networks and organizational systems: Modeling non-linear relationships," European Journal of Operational Research, Elsevier, vol. 181(2), pages 939-955, September.
    14. Leung, Andy & Zhang, Hongyang & Zamar, Ruben, 2016. "Robust regression estimation and inference in the presence of cellwise and casewise contamination," Computational Statistics & Data Analysis, Elsevier, vol. 99(C), pages 1-11.
    15. Huck, Nicolas, 2019. "Large data sets and machine learning: Applications to statistical arbitrage," European Journal of Operational Research, Elsevier, vol. 278(1), pages 330-342.
    16. Çetin, Meral, 2009. "Robust model selection criteria for robust Liu estimator," European Journal of Operational Research, Elsevier, vol. 199(1), pages 21-24, November.
    17. Nazemi, Abdolreza & Heidenreich, Konstantin & Fabozzi, Frank J., 2018. "Improving corporate bond recovery rate prediction using multi-factor support vector regressions," European Journal of Operational Research, Elsevier, vol. 271(2), pages 664-675.
    18. Zhang, Yiyun & Li, Runze & Tsai, Chih-Ling, 2010. "Regularization Parameter Selections via Generalized Information Criterion," Journal of the American Statistical Association, American Statistical Association, vol. 105(489), pages 312-323.
    19. Masci, Chiara & Johnes, Geraint & Agasisti, Tommaso, 2018. "Student and school performance across countries: A machine learning approach," European Journal of Operational Research, Elsevier, vol. 269(3), pages 1072-1085.
    20. Nicolas Huck, 2019. "Large data sets and machine learning: Applications to statistical arbitrage," Post-Print hal-02143971, HAL.
    21. Alexandre Belloni & Victor Chernozhukov & Christian Hansen, 2011. "Inference for High-Dimensional Sparse Econometric Models," Papers 1201.0220, arXiv.org.
    22. Sagaert, Yves R. & Aghezzaf, El-Houssaine & Kourentzes, Nikolaos & Desmet, Bram, 2018. "Tactical sales forecasting using a very large set of macroeconomic indicators," European Journal of Operational Research, Elsevier, vol. 264(2), pages 558-569.
    23. Grace Yoon & Raymond J Carroll & Irina Gaynanova, 2020. "Sparse semiparametric canonical correlation analysis for data of mixed types," Biometrika, Biometrika Trust, vol. 107(3), pages 609-625.
    24. Cui, Hailong & Rajagopalan, Sampath & Ward, Amy R., 2020. "Predicting product return volume using machine learning methods," European Journal of Operational Research, Elsevier, vol. 281(3), pages 612-627.
    25. Gür Ali, Özden & Yaman, Kübra, 2013. "Selecting rows and columns for training support vector regression models with large retail datasets," European Journal of Operational Research, Elsevier, vol. 226(3), pages 471-480.
    26. Martínez, Andrés & Schmuck, Claudia & Pereverzyev, Sergiy & Pirker, Clemens & Haltmeier, Markus, 2020. "A machine learning framework for customer purchase prediction in the non-contractual setting," European Journal of Operational Research, Elsevier, vol. 281(3), pages 588-596.
    27. Wilms, Ines & Gelper, Sarah & Croux, Christophe, 2016. "The predictive power of the business and bank sentiment of firms: A high-dimensional Granger Causality approach," European Journal of Operational Research, Elsevier, vol. 254(1), pages 138-147.
    28. Flores, Salvador, 2015. "SOCP relaxation bounds for the optimal subset selection problem applied to robust linear regression," European Journal of Operational Research, Elsevier, vol. 246(1), pages 44-50.
    29. Martin-Barragan, Belen & Lillo, Rosa & Romo, Juan, 2014. "Interpretable support vector machines for functional data," European Journal of Operational Research, Elsevier, vol. 232(1), pages 146-155.
    30. Khan, Jafar A. & Van Aelst, Stefan & Zamar, Ruben H., 2007. "Robust Linear Model Selection Based on Least Angle Regression," Journal of the American Statistical Association, American Statistical Association, vol. 102, pages 1289-1299, December.
    31. Ballings, Michel & Van den Poel, Dirk, 2015. "CRM in social media: Predicting increases in Facebook usage frequency," European Journal of Operational Research, Elsevier, vol. 244(1), pages 248-260.
    32. Smucler, Ezequiel & Yohai, Victor J., 2017. "Robust and sparse estimators for linear regression models," Computational Statistics & Data Analysis, Elsevier, vol. 111(C), pages 116-130.
    33. Landajo, Manuel & de Andres, Javier & Lorca, Pedro, 2007. "Robust neural modeling for the cross-sectional analysis of accounting information," European Journal of Operational Research, Elsevier, vol. 177(2), pages 1232-1252, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Hossein Tarighi & Zeynab Nourbakhsh Hosseiny & Maryam Akbari & Elaheh Mohammadhosseini, 2023. "The Moderating Effect of the COVID-19 Pandemic on the Relation between Corporate Governance and Firm Performance," JRFM, MDPI, vol. 16(7), pages 1-43, June.
    2. Fu, Saiji & Tian, Yingjie & Tang, Long, 2023. "Robust regression under the general framework of bounded loss functions," European Journal of Operational Research, Elsevier, vol. 310(3), pages 1325-1339.
    3. Barbato, Michele & Ceselli, Alberto, 2024. "Mathematical programming for simultaneous feature selection and outlier detection under l1 norm," European Journal of Operational Research, Elsevier, vol. 316(3), pages 1070-1084.
    4. Mohd Shareduwan Mohd Kasihmuddin & Siti Zulaikha Mohd Jamaludin & Mohd. Asyraf Mansor & Habibah A. Wahab & Siti Maisharah Sheikh Ghadzi, 2022. "Supervised Learning Perspective in Logic Mining," Mathematics, MDPI, vol. 10(6), pages 1-35, March.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. He Jiang, 2023. "Robust forecasting in spatial autoregressive model with total variation regularization," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 42(2), pages 195-211, March.
    2. He Jiang, 2022. "A novel robust structural quadratic forecasting model and applications," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 41(6), pages 1156-1180, September.
    3. Van Belle, Jente & Guns, Tias & Verbeke, Wouter, 2021. "Using shared sell-through data to forecast wholesaler demand in multi-echelon supply chains," European Journal of Operational Research, Elsevier, vol. 288(2), pages 466-479.
    4. Guillaume Coqueret & Tony Guida, 2020. "Training trees on tails with applications to portfolio choice," Post-Print hal-04144665, HAL.
    5. Fildes, Robert & Ma, Shaohui & Kolassa, Stephan, 2019. "Retail forecasting: research and practice," MPRA Paper 89356, University Library of Munich, Germany.
    6. Guillaume Coqueret & Tony Guida, 2020. "Training trees on tails with applications to portfolio choice," Annals of Operations Research, Springer, vol. 288(1), pages 181-221, May.
    7. Ma, Shaohui & Fildes, Robert, 2020. "Forecasting third-party mobile payments with implications for customer flow prediction," International Journal of Forecasting, Elsevier, vol. 36(3), pages 739-760.
    8. Achim Ahrens & Christian B. Hansen & Mark E. Schaffer, 2020. "lassopack: Model selection and prediction with regularized regression in Stata," Stata Journal, StataCorp LP, vol. 20(1), pages 176-235, March.
    9. Ma, Shaohui & Fildes, Robert, 2021. "Retail sales forecasting with meta-learning," European Journal of Operational Research, Elsevier, vol. 288(1), pages 111-128.
    10. Wang, Shixuan & Syntetos, Aris A. & Liu, Ying & Di Cairano-Gilfedder, Carla & Naim, Mohamed M., 2023. "Improving automotive garage operations by categorical forecasts using a large number of variables," European Journal of Operational Research, Elsevier, vol. 306(2), pages 893-908.
    11. Fildes, Robert & Ma, Shaohui & Kolassa, Stephan, 2022. "Retail forecasting: Research and practice," International Journal of Forecasting, Elsevier, vol. 38(4), pages 1283-1318.
    12. Ma, Shaohui & Fildes, Robert, 2017. "A retail store SKU promotions optimization model for category multi-period profit maximization," European Journal of Operational Research, Elsevier, vol. 260(2), pages 680-692.
    13. Gür Ali, Özden & Gürlek, Ragıp, 2020. "Automatic Interpretable Retail forecasting with promotional scenarios," International Journal of Forecasting, Elsevier, vol. 36(4), pages 1389-1406.
    14. Chou, Ping & Chuang, Howard Hao-Chun & Chou, Yen-Chun & Liang, Ting-Peng, 2022. "Predictive analytics for customer repurchase: Interdisciplinary integration of buy till you die modeling and machine learning," European Journal of Operational Research, Elsevier, vol. 296(2), pages 635-651.
    15. Umberto Amato & Anestis Antoniadis & Italia De Feis & Irene Gijbels, 2021. "Penalised robust estimators for sparse and high-dimensional linear models," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(1), pages 1-48, March.
    16. Huber, Jakob & Stuckenschmidt, Heiner, 2020. "Daily retail demand forecasting using machine learning with emphasis on calendric special days," International Journal of Forecasting, Elsevier, vol. 36(4), pages 1420-1438.
    17. O’Sullivan, Conall & Papavassiliou, Vassilios G. & Wafula, Ronald Wekesa & Boubaker, Sabri, 2024. "New insights into liquidity resiliency," Journal of International Financial Markets, Institutions and Money, Elsevier, vol. 90(C).
    18. Victor Chernozhukov & Denis Chetverikov & Mert Demirer & Esther Duflo & Christian Hansen & Whitney K. Newey, 2016. "Double machine learning for treatment and causal parameters," CeMMAP working papers 49/16, Institute for Fiscal Studies.
    19. Philipp Bach & Victor Chernozhukov & Malte S. Kurz & Martin Spindler & Sven Klaassen, 2021. "DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R," Papers 2103.09603, arXiv.org, revised Jun 2024.
    20. Alexander Jakob Dautel & Wolfgang Karl Härdle & Stefan Lessmann & Hsin-Vonn Seow, 2020. "Forex exchange rate forecasting using deep recurrent neural networks," Digital Finance, Springer, vol. 2(1), pages 69-96, September.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:ejores:v:297:y:2022:i:2:p:782-794. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/eor .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.