IDEAS home Printed from https://ideas.repec.org/a/eee/ejores/v265y2018i3p993-1004.html

High dimensional data classification and feature selection using support vector machines

Author

Listed:
  • Ghaddar, Bissan
  • Naoum-Sawaya, Joe

Abstract

In many big-data systems, large amounts of information are recorded and stored for analytics purposes. Often however, this vast amount of information does not offer additional benefits for optimal decision making, but may rather be complicating and too costly for collection, storage, and processing. For instance, tumor classification using high-throughput microarray data is challenging due to the presence of a large number of noisy features that do not contribute to the reduction of classification errors. For such problems, the general aim is to find a limited number of genes that highly differentiate among the classes. Thus in this paper, we address a specific class of machine learning, namely the problem of feature selection within support vector machine classification that deals with finding an accurate binary classifier that uses a minimal number of features. We introduce a new approach based on iteratively adjusting a bound on the l1-norm of the classifier vector in order to force the number of selected features to converge towards the desired maximum limit. We analyze two real-life classification problems with high dimensional features. The first case is the medical diagnosis of tumors based on microarray data where we present a generic approach for cancer classification based on gene expression. The second case deals with sentiment classification of on-line reviews from Amazon, Yelp, and IMDb. The results show that the proposed classification and feature selection approach is simple, computationally tractable, and achieves low error rates which are key for the construction of advanced decision-support systems.

Suggested Citation

  • Ghaddar, Bissan & Naoum-Sawaya, Joe, 2018. "High dimensional data classification and feature selection using support vector machines," European Journal of Operational Research, Elsevier, vol. 265(3), pages 993-1004.
  • Handle: RePEc:eee:ejores:v:265:y:2018:i:3:p:993-1004
    DOI: 10.1016/j.ejor.2017.08.040
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0377221717307713
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ejor.2017.08.040?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Sanjiv R. Das & Mike Y. Chen, 2007. "Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web," Management Science, INFORMS, vol. 53(9), pages 1375-1388, September.
    2. Dunbar, Michelle & Murray, John M. & Cysique, Lucette A. & Brew, Bruce J. & Jeyakumar, Vaithilingam, 2010. "Simultaneous classification and feature selection via convex quadratic programming with application to HIV-associated neurocognitive disorder assessment," European Journal of Operational Research, Elsevier, vol. 206(2), pages 470-478, October.
    3. Bart Baesens & Rudy Setiono & Christophe Mues & Jan Vanthienen, 2003. "Using Neural Network Rule Extraction and Decision Tables for Credit-Risk Evaluation," Management Science, INFORMS, vol. 49(3), pages 312-329, March.
    4. Oded Netzer & Ronen Feldman & Jacob Goldenberg & Moshe Fresko, 2012. "Mine Your Own Business: Market-Structure Surveillance Through Text Mining," Marketing Science, INFORMS, vol. 31(3), pages 521-543, May.
    5. Aytug, Haldun, 2015. "Feature selection for support vector machines using Generalized Benders Decomposition," European Journal of Operational Research, Elsevier, vol. 244(1), pages 210-218.
    6. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    7. Bertolazzi, P. & Felici, G. & Festa, P. & Fiscon, G. & Weitschek, E., 2016. "Integer programming models for feature selection: New extensions and a randomized solution algorithm," European Journal of Operational Research, Elsevier, vol. 250(2), pages 389-399.
    8. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    9. Geng Cui & Man Leung Wong & Hon-Kwong Lui, 2006. "Machine Learning for Direct Marketing Response Models: Bayesian Networks with Evolutionary Programming," Management Science, INFORMS, vol. 52(4), pages 597-612, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Gambella, Claudio & Ghaddar, Bissan & Naoum-Sawaya, Joe, 2021. "Optimization problems for machine learning: A survey," European Journal of Operational Research, Elsevier, vol. 290(3), pages 807-828.
    2. Laura Toschi & Elisa Ughetto & Andrea Fronzetti Colladon, 2023. "The identity of social impact venture capitalists: exploring social linguistic positioning and linguistic distinctiveness through text mining," Small Business Economics, Springer, vol. 60(3), pages 1249-1280, March.
    3. Adam D. Nowak & Bradley S. Price & Patrick S. Smith, 2021. "Real Estate Dictionaries Across Space and Time," The Journal of Real Estate Finance and Economics, Springer, vol. 62(1), pages 139-163, January.
    4. He Jiang, 2023. "Robust forecasting in spatial autoregressive model with total variation regularization," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 42(2), pages 195-211, March.
    5. Pan, Zhiyuan & Zhong, Hao & Wang, Yudong & Huang, Juan, 2024. "Forecasting oil futures returns with news," Energy Economics, Elsevier, vol. 134(C).
    6. Gong, Xue & Ye, Xin & Zhang, Weiguo & Zhang, Yue, 2023. "Predicting energy futures high-frequency volatility using technical indicators: The role of interaction," Energy Economics, Elsevier, vol. 119(C).
    7. Jiang, He & Luo, Shihua & Dong, Yao, 2021. "Simultaneous feature selection and clustering based on square root optimization," European Journal of Operational Research, Elsevier, vol. 289(1), pages 214-231.
    8. Petropoulos, Fotios & Apiletti, Daniele & Assimakopoulos, Vassilios & Babai, Mohamed Zied & Barrow, Devon K. & Ben Taieb, Souhaib & Bergmeir, Christoph & Bessa, Ricardo J. & Bijak, Jakub & Boylan, Joh, 2022. "Forecasting: theory and practice," International Journal of Forecasting, Elsevier, vol. 38(3), pages 705-871.
      • Fotios Petropoulos & Daniele Apiletti & Vassilios Assimakopoulos & Mohamed Zied Babai & Devon K. Barrow & Souhaib Ben Taieb & Christoph Bergmeir & Ricardo J. Bessa & Jakub Bijak & John E. Boylan & Jet, 2020. "Forecasting: theory and practice," Papers 2012.03854, arXiv.org, revised Jan 2022.
    9. Matthew Gentzkow & Bryan T. Kelly & Matt Taddy, 2017. "Text as Data," NBER Working Papers 23276, National Bureau of Economic Research, Inc.
    10. Bommes, Elisabeth & Chen, Cathy Yi-Hsuan & Härdle, Wolfgang Karl, 2018. "Textual Sentiment and Sector specific reaction," IRTG 1792 Discussion Papers 2018-043, Humboldt University of Berlin, International Research Training Group 1792 "High Dimensional Nonstationary Time Series".
    11. Andres Algaba & David Ardia & Keven Bluteau & Samuel Borms & Kris Boudt, 2020. "Econometrics Meets Sentiment: An Overview Of Methodology And Applications," Journal of Economic Surveys, Wiley Blackwell, vol. 34(3), pages 512-547, July.
    12. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    13. Hauzenberger, Niko & Huber, Florian & Klieber, Karin & Marcellino, Massimiliano, 2025. "Bayesian neural networks for macroeconomic analysis," Journal of Econometrics, Elsevier, vol. 249(PC).
    14. Carstensen, Kai & Heinrich, Markus & Reif, Magnus & Wolters, Maik H., 2020. "Predicting ordinary and severe recessions with a three-state Markov-switching dynamic factor model," International Journal of Forecasting, Elsevier, vol. 36(3), pages 829-850.
    15. Hou-Tai Chang & Ping-Huai Wang & Wei-Fang Chen & Chen-Ju Lin, 2022. "Risk Assessment of Early Lung Cancer with LDCT and Health Examinations," IJERPH, MDPI, vol. 19(8), pages 1-12, April.
    16. Hajime Shimao & Sung Joo Kim & Warut Khern-Am-Nuai & Maxime C. Cohen, 2025. "Revisiting the CEO Effect Through a Machine Learning Lens," Management Science, INFORMS, vol. 71(6), pages 5396-5408, June.
    17. Wang, Qiao & Zhou, Wei & Cheng, Yonggang & Ma, Gang & Chang, Xiaolin & Miao, Yu & Chen, E, 2018. "Regularized moving least-square method and regularized improved interpolating moving least-square method with nonsingular moment matrices," Applied Mathematics and Computation, Elsevier, vol. 325(C), pages 120-145.
    18. Mkhadri, Abdallah & Ouhourane, Mohamed, 2013. "An extended variable inclusion and shrinkage algorithm for correlated variables," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 631-644.
    19. Lucian Belascu & Alexandra Horobet & Georgiana Vrinceanu & Consuela Popescu, 2021. "Performance Dissimilarities in European Union Manufacturing: The Effect of Ownership and Technological Intensity," Sustainability, MDPI, vol. 13(18), pages 1-19, September.
    20. Andrea Carriero & Todd E. Clark & Massimiliano Marcellino, 2025. "Specification Choices in Quantile Regression for Empirical Macroeconomics," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 40(1), pages 57-73, January.

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:ejores:v:265:y:2018:i:3:p:993-1004. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/eor .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.