IDEAS home Printed from https://ideas.repec.org/a/eee/ecosta/v26y2023icp124-138.html
   My bibliography  Save this article

Semi-Supervised Learning of Classifiers from a Statistical Perspective: A Brief Review

Author

Listed:
  • Ahfock, Daniel
  • McLachlan, Geoffrey J.

Abstract

There has been increasing attention to semi-supervised learning (SSL) approaches in machine learning to forming a classifier in situations where the training data for a classifier consists of a limited number of classified observations but a much larger number of unclassified observations. This is because the procurement of classified data can be quite costly due to high acquisition costs and subsequent financial, time, and ethical issues that can arise in attempts to provide the true class labels for the unclassified data that have been acquired. A review is provided of statistical SSL approaches to this problem, focussing on the recent result that a classifier formed from a partially classified sample can actually have smaller expected error rate than that if the sample were completely classified. This rather paradoxical outcome is able to be achieved by introducing a framework with a missingness mechanism for the missing labels of the unclassified observations. It is most relevant in commonly occurring situations in practice, where the unclassified data occur primarily in regions of relatively high entropy in the feature space thereby making it difficult for their class labels to be easily obtained.

Suggested Citation

  • Ahfock, Daniel & McLachlan, Geoffrey J., 2023. "Semi-Supervised Learning of Classifiers from a Statistical Perspective: A Brief Review," Econometrics and Statistics, Elsevier, vol. 26(C), pages 124-138.
  • Handle: RePEc:eee:ecosta:v:26:y:2023:i:c:p:124-138
    DOI: 10.1016/j.ecosta.2022.03.007
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S2452306222000296
    Download Restriction: Full text for ScienceDirect subscribers only. Contains open access articles

    File URL: https://libkey.io/10.1016/j.ecosta.2022.03.007?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Fabrizia Mealli & Donald B. Rubin, 2015. "Clarifying missing at random and related definitions, and implications when coupled with exchangeability," Biometrika, Biometrika Trust, vol. 102(4), pages 995-1000.
    2. Michael P. B. Gallaugher & Paul D. McNicholas, 2019. "On Fractionally-Supervised Classification: Weight Selection and Extension to the Multivariate t-Distribution," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 232-265, July.
    3. Irene Vrbik & Paul McNicholas, 2015. "Fractionally-Supervised Classification," Journal of Classification, Springer;The Classification Society, vol. 32(3), pages 359-381, October.
    4. Ofer Harel & Joseph L. Schafer, 2009. "Partial and latent ignorability in missing-data problems," Biometrika, Biometrika Trust, vol. 96(1), pages 37-50.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Utkarsh J. Dang & Michael P.B. Gallaugher & Ryan P. Browne & Paul D. McNicholas, 2023. "Model-Based Clustering and Classification Using Mixtures of Multivariate Skewed Power Exponential Distributions," Journal of Classification, Springer;The Classification Society, vol. 40(1), pages 145-167, April.
    2. Sharon M. McNicholas & Paul D. McNicholas & Daniel A. Ashlock, 2021. "An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 264-279, July.
    3. Vernon T. Farewell & Li Su & Christopher Jackson, 2019. "Partially hidden multi-state modelling of a prolonged disease state defined by a composite outcome," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 25(4), pages 696-711, October.
    4. Paula M. Murray & Ryan P. Browne & Paul D. McNicholas, 2020. "Mixtures of Hidden Truncation Hyperbolic Factor Analyzers," Journal of Classification, Springer;The Classification Society, vol. 37(2), pages 366-379, July.
    5. Fei Wang & Yuhao Deng, 2023. "Non-Asymptotic Bounds of AIPW Estimators for Means with Missingness at Random," Mathematics, MDPI, vol. 11(4), pages 1-14, February.
    6. Robitzsch, Alexander, 2020. "About Still Nonignorable Consequences of (Partially) Ignoring Missing Item Responses in Large-scale Assessment," OSF Preprints hmy45, Center for Open Science.
    7. Jouni Kuha & Myrsini Katsikatsou & Irini Moustaki, 2018. "Latent variable modelling with non‐ignorable item non‐response: multigroup response propensity models for cross‐national analysis," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(4), pages 1169-1192, October.
    8. Chenguang Wang & Michael J. Daniels, 2011. "A Note on MAR, Identifying Restrictions, Model Comparison, and Sensitivity Analysis in Pattern Mixture Models with and without Covariates for Incomplete Data," Biometrics, The International Biometric Society, vol. 67(3), pages 810-818, September.
    9. A. R. Linero, 2017. "Bayesian nonparametric analysis of longitudinal studies in the presence of informative missingness," Biometrika, Biometrika Trust, vol. 104(2), pages 327-341.
    10. Marco Doretti & Sara Geneletti & Elena Stanghellini, 2018. "Missing Data: A Unified Taxonomy Guided by Conditional Independence," International Statistical Review, International Statistical Institute, vol. 86(2), pages 189-204, August.
    11. Douglas L. Steinley, 2019. "Editorial: Journal of Classification Vol. 36–2," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 175-176, July.
    12. Morris, Katherine & McNicholas, Paul D., 2016. "Clustering, classification, discriminant analysis, and dimension reduction via generalized hyperbolic mixtures," Computational Statistics & Data Analysis, Elsevier, vol. 97(C), pages 133-150.
    13. Florian M. Hollenbach & Iavor Bojinov & Shahryar Minhas & Nils W. Metternich & Michael D. Ward & Alexander Volfovsky, 2021. "Multiple Imputation Using Gaussian Copulas," Sociological Methods & Research, , vol. 50(3), pages 1259-1283, August.
    14. Aidan G. O’Keeffe & Daniel M. Farewell & Brian D. M. Tom & Vernon T. Farewell, 2016. "Multiple Imputation of Missing Composite Outcomes in Longitudinal Data," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 8(2), pages 310-332, October.
    15. D. M. Farewell & C. Huang & V. Didelez, 2017. "Ignorability for general longitudinal data," Biometrika, Biometrika Trust, vol. 104(2), pages 317-326.
    16. Hossein Baloochian & Hamid Reza Ghaffary, 2019. "Multiclass Classification Based on Multi-criteria Decision-making," Journal of Classification, Springer;The Classification Society, vol. 36(1), pages 140-151, April.
    17. Wei, Yuhong & Tang, Yang & McNicholas, Paul D., 2019. "Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data," Computational Statistics & Data Analysis, Elsevier, vol. 130(C), pages 18-41.
    18. Cristina Tortora & Brian C. Franczak & Ryan P. Browne & Paul D. McNicholas, 2019. "A Mixture of Coalesced Generalized Hyperbolic Distributions," Journal of Classification, Springer;The Classification Society, vol. 36(1), pages 26-57, April.
    19. Paul D. McNicholas, 2016. "Model-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 33(3), pages 331-373, October.
    20. Michael P. B. Gallaugher & Paul D. McNicholas, 2019. "On Fractionally-Supervised Classification: Weight Selection and Extension to the Multivariate t-Distribution," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 232-265, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:ecosta:v:26:y:2023:i:c:p:124-138. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: https://www.journals.elsevier.com/econometrics-and-statistics .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.