Supervised t -Distributed Stochastic Neighbor Embedding for Data Visualization and Classification

My bibliography Save this article

Supervised t -Distributed Stochastic Neighbor Embedding for Data Visualization and Classification

Author

Listed:

Yichen Cheng
(Institute for Insight, Georgia State University, Atlanta, Georgia 30303)
Xinlei Wang
(Department of Statistical Science, Southern Methodist University, Dallas, Texas 75275)
Yusen Xia
(Institute for Insight, Georgia State University, Atlanta, Georgia 30303)

Registered:

Abstract

We propose a novel supervised dimension-reduction method called supervised t-distributed stochastic neighbor embedding (St-SNE) that achieves dimension reduction by preserving the similarities of data points in both feature and outcome spaces. The proposed method can be used for both prediction and visualization tasks with the ability to handle high-dimensional data. We show through a variety of data sets that when compared with a comprehensive list of existing methods, St-SNE has superior prediction performance in the ultrahigh-dimensional setting in which the number of features p exceeds the sample size n and has competitive performance in the p ≤ n setting. We also show that St-SNE is a competitive visualization tool that is capable of capturing within-cluster variations. In addition, we propose a penalized Kullback–Leibler divergence criterion to automatically select the reduced-dimension size k for St-SNE. Summary of Contribution: With the fast development of data collection and data processing technologies, high-dimensional data have now become ubiquitous. Examples of such data include those collected from environmental sensors, personal mobile devices, and wearable electronics. High-dimensionality poses great challenges for data analytics routines, both methodologically and computationally. Many machine learning algorithms may fail to work for ultrahigh-dimensional data, where the number of the features p is (much) larger than the sample size n . We propose a novel method for dimension reduction that can (i) aid the understanding of high-dimensional data through visualization and (ii) create a small set of good predictors, which is especially useful for prediction using ultrahigh-dimensional data.

Suggested Citation

Yichen Cheng & Xinlei Wang & Yusen Xia, 2021. "Supervised t -Distributed Stochastic Neighbor Embedding for Data Visualization and Classification," INFORMS Journal on Computing, INFORMS, vol. 33(2), pages 566-585, May.

Handle: RePEc:inm:orijoc:v:33:y:2021:i:2:p:566-585
DOI: 10.1287/ijoc.2020.0961

Download full text from publisher

References listed on IDEAS

Witten, Daniela M. & Tibshirani, Robert, 2011. "Supervised multidimensional scaling for visualization, classification, and bipartite ranking," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 789-801, January.
Fan J. & Li R., 2001. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 1348-1360, December.
Jianqing Fan & Jinchi Lv, 2008. "Sure independence screening for ultrahigh dimensional feature space," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 70(5), pages 849-911, November.
Yichen Cheng & James Y. Dai & Xiaoyu Wang & Charles Kooperberg, 2018. "Identifying disease‐associated copy number variations by a doubly penalized regression model," Biometrics, The International Biometric Society, vol. 74(4), pages 1341-1350, December.
Bair, Eric & Hastie, Trevor & Paul, Debashis & Tibshirani, Robert, 2006. "Prediction by Supervised Principal Components," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 119-137, March.

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Jianqing Fan & Yang Feng & Jiancheng Jiang & Xin Tong, 2016. "Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(513), pages 275-287, March.
Tae-Hwy Lee & Zhou Xi & Ru Zhang, 2013. "Testing for Neglected Nonlinearity Using Regularized Artificial Neural Networks," Working Papers 201422, University of California at Riverside, Department of Economics, revised Apr 2012.
Shu, Lei & Hao, Yifan & Chen, Yu & Yang, Qing, 2025. "SFQRA: Scaled factor-augmented quantile regression with aggregation in conditional mean forecasting," Journal of Multivariate Analysis, Elsevier, vol. 207(C).
Meng An & Haixiang Zhang, 2023. "High-Dimensional Mediation Analysis for Time-to-Event Outcomes with Additive Hazards Model," Mathematics, MDPI, vol. 11(24), pages 1-11, December.
Tomohiro Ando & Ruey S. Tsay, 2009. "Model selection for generalized linear models with factor‐augmented predictors," Applied Stochastic Models in Business and Industry, John Wiley & Sons, vol. 25(3), pages 207-235, May.
- T. Ando & R. S. Tsay, 2009. "‘Model selection for generalized linear models with factor‐augmented predictors’," Applied Stochastic Models in Business and Industry, John Wiley & Sons, vol. 25(3), pages 243-246, May.
Fan, Jianqing & Jiang, Bai & Sun, Qiang, 2022. "Bayesian factor-adjusted sparse regression," Journal of Econometrics, Elsevier, vol. 230(1), pages 3-19.
Shuichi Kawano, 2014. "Selection of tuning parameters in bridge regression models via Bayesian information criterion," Statistical Papers, Springer, vol. 55(4), pages 1207-1223, November.
Hang Yu & Yuanjia Wang & Donglin Zeng, 2023. "A general framework of nonparametric feature selection in high‐dimensional data," Biometrics, The International Biometric Society, vol. 79(2), pages 951-963, June.
Zhaoyu Xing & Yang Wan & Juan Wen & Wei Zhong, 2024. "GOLFS: feature selection via combining both global and local information for high dimensional clustering," Computational Statistics, Springer, vol. 39(5), pages 2651-2675, July.
Shan Luo & Zehua Chen, 2014. "Sequential Lasso Cum EBIC for Feature Selection With Ultra-High Dimensional Feature Space," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 109(507), pages 1229-1240, September.
Shi Chen & Wolfgang Karl Hardle & Brenda L'opez Cabrera, 2020. "Regularization Approach for Network Modeling of German Power Derivative Market," Papers 2009.09739, arXiv.org.
N. Neykov & P. Filzmoser & P. Neytchev, 2014. "Ultrahigh dimensional variable selection through the penalized maximum trimmed likelihood estimator," Statistical Papers, Springer, vol. 55(1), pages 187-207, February.
- N. Neykov & P. Filzmoser & P. Neytchev, 2014. "Erratum to: Ultrahigh dimensional variable selection through the penalized maximum trimmed likelihood estimator," Statistical Papers, Springer, vol. 55(3), pages 917-918, August.
Wang, Christina Dan & Chen, Zhao & Lian, Yimin & Chen, Min, 2022. "Asset selection based on high frequency Sharpe ratio," Journal of Econometrics, Elsevier, vol. 227(1), pages 168-188.
Laurent Ferrara & Anna Simoni, 2023. "When are Google Data Useful to Nowcast GDP? An Approach via Preselection and Shrinkage," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 41(4), pages 1188-1202, October.
- Laurent Ferrara & Anna Simoni, 2019. "When are Google data useful to nowcast GDP? An approach via pre-selection and shrinkage," Working Papers 2019-04, Center for Research in Economics and Statistics.
- Laurent Ferrara & Anna Simoni, 2023. "When are Google Data Useful to Nowcast GDP? An Approach via Preselection and Shrinkage," Post-Print hal-03919944, HAL.
- Laurent Ferrara & Anna Simoni, 2020. "When are Google data useful to nowcast GDP? An approach via pre-selection and shrinkage," EconomiX Working Papers 2020-11, University of Paris Nanterre, EconomiX.
- Laurent Ferrara & Anna Simoni, 2019. "When are Google data useful to nowcast GDP? An approach via pre-selection and shrinkage," Working papers 717, Banque de France.
- Laurent Ferrara & Anna Simoni, 2020. "When are Google data useful to nowcast GDP? An approach via pre-selection and shrinkage," Papers 2007.00273, arXiv.org, revised Sep 2022.
- Laurent Ferrara & Anna Simoni, 2020. "When are Google data useful to nowcast GDP? An approach via pre-selection and shrinkage," Working Papers hal-04159714, HAL.
Caroline Jardet & Baptiste Meunier, 2022. "Nowcasting world GDP growth with high‐frequency data," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 41(6), pages 1181-1200, September.
- Caroline Jardet & Baptiste Meunier, 2020. "Nowcasting World GDP Growth with High-Frequency Data," Working papers 788, Banque de France.
- Caroline Jardet & Baptiste Meunier, 2022. "Nowcasting world GDP growth with high‐frequency data," Post-Print hal-03647097, HAL.
Peter Bühlmann & Jacopo Mandozzi, 2014. "High-dimensional variable screening and bias in subsequent inference, with an empirical comparison," Computational Statistics, Springer, vol. 29(3), pages 407-430, June.
Sangjin Kim & Jong-Min Kim, 2019. "Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data," Mathematics, MDPI, vol. 7(6), pages 1-16, May.
Anders Bredahl Kock, 2012. "On the Oracle Property of the Adaptive Lasso in Stationary and Nonstationary Autoregressions," CREATES Research Papers 2012-05, Department of Economics and Business Economics, Aarhus University.
Tang, Yanlin & Song, Xinyuan & Wang, Huixia Judy & Zhu, Zhongyi, 2013. "Variable selection in high-dimensional quantile varying coefficient models," Journal of Multivariate Analysis, Elsevier, vol. 122(C), pages 115-132.
Loann David Denis Desboulets, 2018. "A Review on Variable Selection in Regression Analysis," Econometrics, MDPI, vol. 6(4), pages 1-27, November.
- Loann David Denis Desboulets, 2018. "A Review on Variable Selection in Regression Analysis," Post-Print hal-01954386, HAL.

More about this item

Keywords

; ; ; ; ;

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inm:orijoc:v:33:y:2021:i:2:p:566-585. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Asher (email available below). General contact details of provider: https://edirc.repec.org/data/inforea.html .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Supervised t -Distributed Stochastic Neighbor Embedding for Data Visualization and Classification

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data