IDEAS home Printed from https://ideas.repec.org/a/eee/ejores/v265y2018i3p993-1004.html
   My bibliography  Save this article

High dimensional data classification and feature selection using support vector machines

Author

Listed:
  • Ghaddar, Bissan
  • Naoum-Sawaya, Joe

Abstract

In many big-data systems, large amounts of information are recorded and stored for analytics purposes. Often however, this vast amount of information does not offer additional benefits for optimal decision making, but may rather be complicating and too costly for collection, storage, and processing. For instance, tumor classification using high-throughput microarray data is challenging due to the presence of a large number of noisy features that do not contribute to the reduction of classification errors. For such problems, the general aim is to find a limited number of genes that highly differentiate among the classes. Thus in this paper, we address a specific class of machine learning, namely the problem of feature selection within support vector machine classification that deals with finding an accurate binary classifier that uses a minimal number of features. We introduce a new approach based on iteratively adjusting a bound on the l1-norm of the classifier vector in order to force the number of selected features to converge towards the desired maximum limit. We analyze two real-life classification problems with high dimensional features. The first case is the medical diagnosis of tumors based on microarray data where we present a generic approach for cancer classification based on gene expression. The second case deals with sentiment classification of on-line reviews from Amazon, Yelp, and IMDb. The results show that the proposed classification and feature selection approach is simple, computationally tractable, and achieves low error rates which are key for the construction of advanced decision-support systems.

Suggested Citation

  • Ghaddar, Bissan & Naoum-Sawaya, Joe, 2018. "High dimensional data classification and feature selection using support vector machines," European Journal of Operational Research, Elsevier, vol. 265(3), pages 993-1004.
  • Handle: RePEc:eee:ejores:v:265:y:2018:i:3:p:993-1004
    DOI: 10.1016/j.ejor.2017.08.040
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0377221717307713
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ejor.2017.08.040?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Sanjiv R. Das & Mike Y. Chen, 2007. "Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web," Management Science, INFORMS, vol. 53(9), pages 1375-1388, September.
    2. Dunbar, Michelle & Murray, John M. & Cysique, Lucette A. & Brew, Bruce J. & Jeyakumar, Vaithilingam, 2010. "Simultaneous classification and feature selection via convex quadratic programming with application to HIV-associated neurocognitive disorder assessment," European Journal of Operational Research, Elsevier, vol. 206(2), pages 470-478, October.
    3. Bart Baesens & Rudy Setiono & Christophe Mues & Jan Vanthienen, 2003. "Using Neural Network Rule Extraction and Decision Tables for Credit-Risk Evaluation," Management Science, INFORMS, vol. 49(3), pages 312-329, March.
    4. Oded Netzer & Ronen Feldman & Jacob Goldenberg & Moshe Fresko, 2012. "Mine Your Own Business: Market-Structure Surveillance Through Text Mining," Marketing Science, INFORMS, vol. 31(3), pages 521-543, May.
    5. Aytug, Haldun, 2015. "Feature selection for support vector machines using Generalized Benders Decomposition," European Journal of Operational Research, Elsevier, vol. 244(1), pages 210-218.
    6. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    7. Bertolazzi, P. & Felici, G. & Festa, P. & Fiscon, G. & Weitschek, E., 2016. "Integer programming models for feature selection: New extensions and a randomized solution algorithm," European Journal of Operational Research, Elsevier, vol. 250(2), pages 389-399.
    8. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    9. Geng Cui & Man Leung Wong & Hon-Kwong Lui, 2006. "Machine Learning for Direct Marketing Response Models: Bayesian Networks with Evolutionary Programming," Management Science, INFORMS, vol. 52(4), pages 597-612, April.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Bottmer, Lea & Croux, Christophe & Wilms, Ines, 2022. "Sparse regression for large data sets with outliers," European Journal of Operational Research, Elsevier, vol. 297(2), pages 782-794.
    2. He Jiang, 2023. "Robust forecasting in spatial autoregressive model with total variation regularization," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 42(2), pages 195-211, March.
    3. Zhang, Yishi & Zhu, Ruilin & Chen, Zhijun & Gao, Jie & Xia, De, 2021. "Evaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data," European Journal of Operational Research, Elsevier, vol. 290(1), pages 235-247.
    4. Jimenez-Marquez, Jose Luis & Gonzalez-Carrasco, Israel & Lopez-Cuadrado, Jose Luis & Ruiz-Mezcua, Belen, 2019. "Towards a big data framework for analyzing social media content," International Journal of Information Management, Elsevier, vol. 44(C), pages 1-12.
    5. Jiang, He & Luo, Shihua & Dong, Yao, 2021. "Simultaneous feature selection and clustering based on square root optimization," European Journal of Operational Research, Elsevier, vol. 289(1), pages 214-231.
    6. Ni, Ji & Chen, Bowei & Allinson, Nigel M. & Ye, Xujiong, 2020. "A hybrid model for predicting human physical activity status from lifelogging data," European Journal of Operational Research, Elsevier, vol. 281(3), pages 532-542.
    7. Basna Mohammed Salih Hasan & Nawzat Sadiq Ahmed, 2021. "Feature selection technique applied in Medical application by Supervised algorithm: A Review," International Journal of Science and Business, IJSAB International, vol. 5(3), pages 190-203.
    8. You-Shyang Chen & Ying-Hsun Hung & Yu-Sheng Lin, 2023. "A Study to Identify Long-Term Care Insurance Using Advanced Intelligent RST Hybrid Models with Two-Stage Performance Evaluation," Mathematics, MDPI, vol. 11(13), pages 1-34, July.
    9. Subhadip Sarkar, 2023. "ABC classification using extended R-model, SVM and Lorenz curve," OPSEARCH, Springer;Operational Research Society of India, vol. 60(3), pages 1433-1455, September.
    10. Jiang, He & Tao, Changqi & Dong, Yao & Xiong, Ren, 2021. "Robust low-rank multiple kernel learning with compound regularization," European Journal of Operational Research, Elsevier, vol. 295(2), pages 634-647.
    11. Zhang, Yucheng & Xu, Shan & Zhang, Long & Yang, Mengxi, 2021. "Big data and human resource management research: An integrative review and new directions for future research," Journal of Business Research, Elsevier, vol. 133(C), pages 34-50.
    12. Jiménez-Cordero, Asunción & Morales, Juan Miguel & Pineda, Salvador, 2021. "A novel embedded min-max approach for feature selection in nonlinear Support Vector Machine classification," European Journal of Operational Research, Elsevier, vol. 293(1), pages 24-35.
    13. Víctor Blanco & Alberto Japón & Justo Puerto, 2020. "Optimal arrangements of hyperplanes for SVM-based multiclass classification," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(1), pages 175-199, March.
    14. Davila-Pena, Laura & García-Jurado, Ignacio & Casas-Méndez, Balbina, 2022. "Assessment of the influence of features on a classification problem: An application to COVID-19 patients," European Journal of Operational Research, Elsevier, vol. 299(2), pages 631-641.
    15. Li, An-Da & He, Zhen & Wang, Qing & Zhang, Yang, 2019. "Key quality characteristics selection for imbalanced production data using a two-phase bi-objective feature selection method," European Journal of Operational Research, Elsevier, vol. 274(3), pages 978-989.
    16. Díaz, Verónica & Montoya, Ricardo & Maldonado, Sebastián, 2023. "Preference estimation under bounded rationality: Identification of attribute non-attendance in stated-choice data using a support vector machines approach," European Journal of Operational Research, Elsevier, vol. 304(2), pages 797-812.
    17. He Jiang & Weihua Zheng, 2022. "Deep learning with regularized robust long‐ and short‐term memory network for probabilistic short‐term load forecasting," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 41(6), pages 1201-1216, September.
    18. Jiapeng Liu & Miłosz Kadziński & Xiuwu Liao & Xiaoxin Mao, 2021. "Data-Driven Preference Learning Methods for Value-Driven Multiple Criteria Sorting with Interacting Criteria," INFORMS Journal on Computing, INFORMS, vol. 33(2), pages 586-606, May.
    19. Gambella, Claudio & Ghaddar, Bissan & Naoum-Sawaya, Joe, 2021. "Optimization problems for machine learning: A survey," European Journal of Operational Research, Elsevier, vol. 290(3), pages 807-828.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Gambella, Claudio & Ghaddar, Bissan & Naoum-Sawaya, Joe, 2021. "Optimization problems for machine learning: A survey," European Journal of Operational Research, Elsevier, vol. 290(3), pages 807-828.
    2. Laura Toschi & Elisa Ughetto & Andrea Fronzetti Colladon, 2023. "The identity of social impact venture capitalists: exploring social linguistic positioning and linguistic distinctiveness through text mining," Small Business Economics, Springer, vol. 60(3), pages 1249-1280, March.
    3. Adam D. Nowak & Bradley S. Price & Patrick S. Smith, 2021. "Real Estate Dictionaries Across Space and Time," The Journal of Real Estate Finance and Economics, Springer, vol. 62(1), pages 139-163, January.
    4. He Jiang, 2023. "Robust forecasting in spatial autoregressive model with total variation regularization," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 42(2), pages 195-211, March.
    5. Gong, Xue & Ye, Xin & Zhang, Weiguo & Zhang, Yue, 2023. "Predicting energy futures high-frequency volatility using technical indicators: The role of interaction," Energy Economics, Elsevier, vol. 119(C).
    6. Jiang, He & Luo, Shihua & Dong, Yao, 2021. "Simultaneous feature selection and clustering based on square root optimization," European Journal of Operational Research, Elsevier, vol. 289(1), pages 214-231.
    7. Petropoulos, Fotios & Apiletti, Daniele & Assimakopoulos, Vassilios & Babai, Mohamed Zied & Barrow, Devon K. & Ben Taieb, Souhaib & Bergmeir, Christoph & Bessa, Ricardo J. & Bijak, Jakub & Boylan, Joh, 2022. "Forecasting: theory and practice," International Journal of Forecasting, Elsevier, vol. 38(3), pages 705-871.
      • Fotios Petropoulos & Daniele Apiletti & Vassilios Assimakopoulos & Mohamed Zied Babai & Devon K. Barrow & Souhaib Ben Taieb & Christoph Bergmeir & Ricardo J. Bessa & Jakub Bijak & John E. Boylan & Jet, 2020. "Forecasting: theory and practice," Papers 2012.03854, arXiv.org, revised Jan 2022.
    8. Matthew Gentzkow & Bryan T. Kelly & Matt Taddy, 2017. "Text as Data," NBER Working Papers 23276, National Bureau of Economic Research, Inc.
    9. Bommes, Elisabeth & Chen, Cathy Yi-Hsuan & Härdle, Wolfgang Karl, 2018. "Textual Sentiment and Sector specific reaction," IRTG 1792 Discussion Papers 2018-043, Humboldt University of Berlin, International Research Training Group 1792 "High Dimensional Nonstationary Time Series".
    10. Andres Algaba & David Ardia & Keven Bluteau & Samuel Borms & Kris Boudt, 2020. "Econometrics Meets Sentiment: An Overview Of Methodology And Applications," Journal of Economic Surveys, Wiley Blackwell, vol. 34(3), pages 512-547, July.
    11. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    12. Carstensen, Kai & Heinrich, Markus & Reif, Magnus & Wolters, Maik H., 2020. "Predicting ordinary and severe recessions with a three-state Markov-switching dynamic factor model," International Journal of Forecasting, Elsevier, vol. 36(3), pages 829-850.
    13. Hou-Tai Chang & Ping-Huai Wang & Wei-Fang Chen & Chen-Ju Lin, 2022. "Risk Assessment of Early Lung Cancer with LDCT and Health Examinations," IJERPH, MDPI, vol. 19(8), pages 1-12, April.
    14. Wang, Qiao & Zhou, Wei & Cheng, Yonggang & Ma, Gang & Chang, Xiaolin & Miao, Yu & Chen, E, 2018. "Regularized moving least-square method and regularized improved interpolating moving least-square method with nonsingular moment matrices," Applied Mathematics and Computation, Elsevier, vol. 325(C), pages 120-145.
    15. Mkhadri, Abdallah & Ouhourane, Mohamed, 2013. "An extended variable inclusion and shrinkage algorithm for correlated variables," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 631-644.
    16. Lucian Belascu & Alexandra Horobet & Georgiana Vrinceanu & Consuela Popescu, 2021. "Performance Dissimilarities in European Union Manufacturing: The Effect of Ownership and Technological Intensity," Sustainability, MDPI, vol. 13(18), pages 1-19, September.
    17. Candelon, B. & Hurlin, C. & Tokpavi, S., 2012. "Sampling error and double shrinkage estimation of minimum variance portfolios," Journal of Empirical Finance, Elsevier, vol. 19(4), pages 511-527.
    18. Andrea Carriero & Todd E. Clark & Massimiliano Marcellino, 2022. "Specification Choices in Quantile Regression for Empirical Macroeconomics," Working Papers 22-25, Federal Reserve Bank of Cleveland.
    19. Kim, Hyun Hak & Swanson, Norman R., 2018. "Mining big data using parsimonious factor, machine learning, variable selection and shrinkage methods," International Journal of Forecasting, Elsevier, vol. 34(2), pages 339-354.
    20. Shuichi Kawano, 2014. "Selection of tuning parameters in bridge regression models via Bayesian information criterion," Statistical Papers, Springer, vol. 55(4), pages 1207-1223, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:ejores:v:265:y:2018:i:3:p:993-1004. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/eor .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.