IDEAS home Printed from https://ideas.repec.org/a/spr/stpapr/v64y2023i3d10.1007_s00362-022-01342-8.html
   My bibliography  Save this article

Conditional characteristic feature screening for massive imbalanced data

Author

Listed:
  • Ping Wang

    (Shandong University)

  • Lu Lin

    (Shandong University)

Abstract

Using conditional characteristic function as a screening index, a new model-free screening procedure is proposed to deal with variable screening problems in large-scale high-dimensional imbalanced data analysis. For binary response, our results show that the screening index under full data is proportional to the screening index under case–control sampling, an important sampling property for imbalanced data. This conclusion implies that we can apply this screening method to imbalanced data. Surely, the most appealing feature of the screening index is that it can be expressed as a simple linear combination of two first-order moments, so it is computationally simple. In addition, we successfully extend this method to multiple response. The theoretical properties are established under regularity conditions. To compare the performance of our method with its competitors, extensive simulations are conducted, which shows that the proposed procedure performs well in both the linear and nonlinear models. Finally, a real data analysis is investigated to further illustrate the effectiveness of the new method.

Suggested Citation

  • Ping Wang & Lu Lin, 2023. "Conditional characteristic feature screening for massive imbalanced data," Statistical Papers, Springer, vol. 64(3), pages 807-834, June.
  • Handle: RePEc:spr:stpapr:v:64:y:2023:i:3:d:10.1007_s00362-022-01342-8
    DOI: 10.1007/s00362-022-01342-8
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00362-022-01342-8
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00362-022-01342-8?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Hengjian Cui & Runze Li & Wei Zhong, 2015. "Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(510), pages 630-641, June.
    2. Runze Li & Wei Zhong & Liping Zhu, 2012. "Feature Screening via Distance Correlation Learning," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 107(499), pages 1129-1139, September.
    3. Kani Chen, 2001. "Parametric models for response‐biased sampling," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 63(4), pages 775-789.
    4. Xiangyu Wang & Chenlei Leng, 2016. "High dimensional ordinary least squares projection for screening variables," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(3), pages 589-611, June.
    5. Rui Pan & Hansheng Wang & Runze Li, 2016. "Ultrahigh-Dimensional Multiclass Linear Discriminant Analysis by Pairwise Sure Independence Screening," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(513), pages 169-179, March.
    6. Rui Song & Wenbin Lu & Shuangge Ma & X. Jessie Jeng, 2014. "Censored rank independence screening for high-dimensional survival data," Biometrika, Biometrika Trust, vol. 101(4), pages 799-814.
    7. Fan, Jianqing & Feng, Yang & Song, Rui, 2011. "Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models," Journal of the American Statistical Association, American Statistical Association, vol. 106(494), pages 544-557.
    8. Lu, Jun & Lin, Lu, 2018. "Feature screening for multi-response varying coefficient models with ultrahigh dimensional predictors," Computational Statistics & Data Analysis, Elsevier, vol. 128(C), pages 242-254.
    9. Jian Kang & Hyokyoung G Hong & Yi Li, 2017. "Partition-based ultrahigh-dimensional variable screening," Biometrika, Biometrika Trust, vol. 104(4), pages 785-800.
    10. Jianqing Fan & Jinchi Lv, 2008. "Sure independence screening for ultrahigh dimensional feature space," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 70(5), pages 849-911, November.
    11. Shan Luo & Zehua Chen, 2020. "Feature Selection by Canonical Correlation Search in High-Dimensional Multiresponse Models With Complex Group Structures," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 115(531), pages 1227-1235, July.
    12. HaiYing Wang & Rong Zhu & Ping Ma, 2018. "Optimal Subsampling for Large Sample Logistic Regression," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(522), pages 829-844, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lu, Jun & Lin, Lu & Wang, WenWu, 2021. "Partition-based feature screening for categorical data via RKHS embeddings," Computational Statistics & Data Analysis, Elsevier, vol. 157(C).
    2. Jing Zhang & Qihua Wang & Xuan Wang, 2022. "Surrogate-variable-based model-free feature screening for survival data under the general censoring mechanism," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 74(2), pages 379-397, April.
    3. Zhang, Jing & Wang, Qihua & Kang, Jian, 2020. "Feature screening under missing indicator imputation with non-ignorable missing response," Computational Statistics & Data Analysis, Elsevier, vol. 149(C).
    4. Jun Lu & Dan Wang & Qinqin Hu, 2022. "Interaction screening via canonical correlation," Computational Statistics, Springer, vol. 37(5), pages 2637-2670, November.
    5. He, Yong & Zhang, Liang & Ji, Jiadong & Zhang, Xinsheng, 2019. "Robust feature screening for elliptical copula regression model," Journal of Multivariate Analysis, Elsevier, vol. 173(C), pages 568-582.
    6. Zhong, Wei & Wang, Jiping & Chen, Xiaolin, 2021. "Censored mean variance sure independence screening for ultrahigh dimensional survival data," Computational Statistics & Data Analysis, Elsevier, vol. 159(C).
    7. Sheng, Ying & Wang, Qihua, 2020. "Model-free feature screening for ultrahigh dimensional classification," Journal of Multivariate Analysis, Elsevier, vol. 178(C).
    8. Fengli Song & Peng Lai & Baohua Shen, 2020. "Robust composite weighted quantile screening for ultrahigh dimensional discriminant analysis," Metrika: International Journal for Theoretical and Applied Statistics, Springer, vol. 83(7), pages 799-820, October.
    9. Dong, Yuexiao & Yu, Zhou & Zhu, Liping, 2020. "Model-free variable selection for conditional mean in regression," Computational Statistics & Data Analysis, Elsevier, vol. 152(C).
    10. Shuaishuai Chen & Jun Lu, 2023. "Quantile-Composited Feature Screening for Ultrahigh-Dimensional Data," Mathematics, MDPI, vol. 11(10), pages 1-21, May.
    11. He, Shengmei & Ma, Shuangge & Xu, Wangli, 2019. "A modified mean-variance feature-screening procedure for ultrahigh-dimensional discriminant analysis," Computational Statistics & Data Analysis, Elsevier, vol. 137(C), pages 155-169.
    12. Zhao, Bangxin & Liu, Xin & He, Wenqing & Yi, Grace Y., 2021. "Dynamic tilted current correlation for high dimensional variable screening," Journal of Multivariate Analysis, Elsevier, vol. 182(C).
    13. Xin-Bing Kong & Zhi Liu & Yuan Yao & Wang Zhou, 2017. "Sure screening by ranking the canonical correlations," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 26(1), pages 46-70, March.
    14. Yang, Baoying & Yin, Xiangrong & Zhang, Nan, 2019. "Sufficient variable selection using independence measures for continuous response," Journal of Multivariate Analysis, Elsevier, vol. 173(C), pages 480-493.
    15. Yan, Xiaodong & Tang, Niansheng & Xie, Jinhan & Ding, Xianwen & Wang, Zhiqiang, 2018. "Fused mean–variance filter for feature screening," Computational Statistics & Data Analysis, Elsevier, vol. 122(C), pages 18-32.
    16. Tang, Niansheng & Xia, Linli & Yan, Xiaodong, 2019. "Feature screening in ultrahigh-dimensional partially linear models with missing responses at random," Computational Statistics & Data Analysis, Elsevier, vol. 133(C), pages 208-227.
    17. Jing Zhang & Guosheng Yin & Yanyan Liu & Yuanshan Wu, 2018. "Censored cumulative residual independent screening for ultrahigh-dimensional survival data," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 24(2), pages 273-292, April.
    18. Zheng, Zemin & Shi, Haiyu & Li, Yang & Yuan, Hui, 2020. "Uniform joint screening for ultra-high dimensional graphical models," Journal of Multivariate Analysis, Elsevier, vol. 179(C).
    19. Baiguo An & Guozhong Feng & Jianhua Guo, 2022. "Interaction Identification and Clique Screening for Classification with Ultra-high Dimensional Discrete Features," Journal of Classification, Springer;The Classification Society, vol. 39(1), pages 122-146, March.
    20. Jing Zhang & Yanyan Liu & Hengjian Cui, 2021. "Model-free feature screening via distance correlation for ultrahigh dimensional survival data," Statistical Papers, Springer, vol. 62(6), pages 2711-2738, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:stpapr:v:64:y:2023:i:3:d:10.1007_s00362-022-01342-8. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.