IDEAS home Printed from https://ideas.repec.org/a/spr/compst/v39y2024i2d10.1007_s00180-023-01344-6.html
   My bibliography  Save this article

Reducing the overfitting in the gROC curve estimation

Author

Listed:
  • Pablo Martínez-Camblor

    (Geisel School of Medicine at Dartmouth
    Universidad Autonoma de Chile)

  • Susana Díaz-Coto

    (Geisel School of Medicine at Dartmouth)

Abstract

The generalized receiver-operating characteristic, gROC, curve considers the classification ability of diagnostic tests when both larger and lower values of the marker are associated with higher probabilities of being positive. Its empirical estimation implies to select the best classification subsets among those satisfying particular condition. Both strong and weak consistency have already been proved. However, using the same data for both to select the classification subsets and to calculate its gROC curve leads to an over-optimistic estimate of the real performance of the diagnostic criteria on future samples. In this work, the bias of the empirical gROC curve estimator is explored through Monte Carlo simulations. Besides, two cross-validation based algorithms are proposed for reducing the overfitting. The practical application of the proposed algorithms is illustrated through the analysis of a real-world dataset. Simulation results suggest that the empirical gROC curve estimator returns optimistic approximations, especially, in situations in which the diagnostic capacity of the marker is poor and the sample size is small. The new proposed algorithms improve the estimation of the actual diagnostic test accuracy, and get almost unbiased gAUCs in most of the considered scenarios. However, the cross-validation based algorithms reported larger $$L_1$$ L 1 -errors than the standard empirical estimators, and increment the computational cost of the procedures. As online supplementary material, this manuscript includes an R function which wraps up the implemented routines.

Suggested Citation

  • Pablo Martínez-Camblor & Susana Díaz-Coto, 2024. "Reducing the overfitting in the gROC curve estimation," Computational Statistics, Springer, vol. 39(2), pages 1005-1022, April.
  • Handle: RePEc:spr:compst:v:39:y:2024:i:2:d:10.1007_s00180-023-01344-6
    DOI: 10.1007/s00180-023-01344-6
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00180-023-01344-6
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00180-023-01344-6?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. J. B. Copas, 2002. "Overestimation of the receiver operating characteristic curve for logistic regression," Biometrika, Biometrika Trust, vol. 89(2), pages 315-331, June.
    2. Ma, Hua & Bandos, Andriy I. & Gur, David, 2018. "Informativeness of diagnostic marker values and the impact of data grouping," Computational Statistics & Data Analysis, Elsevier, vol. 117(C), pages 76-89.
    3. Sonia Pérez-Fernández & Pablo Martínez-Camblor & Peter Filzmoser & Norberto Corral, 2021. "Visualizing the decision rules behind the ROC curves: understanding the classification process," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 105(1), pages 135-161, March.
    4. Martin W. McIntosh & Margaret Sullivan Pepe, 2002. "Combining Several Screening Tests: Optimality of the Risk Score," Biometrics, The International Biometric Society, vol. 58(3), pages 657-664, September.
    5. Airola, Antti & Pahikkala, Tapio & Waegeman, Willem & De Baets, Bernard & Salakoski, Tapio, 2011. "An experimental comparison of cross-validation techniques for estimating the area under the ROC curve," Computational Statistics & Data Analysis, Elsevier, vol. 55(4), pages 1828-1844, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Margaret Sullivan Pepe & Tianxi Cai & Gary Longton, 2006. "Combining Predictors for Classification Using the Area under the Receiver Operating Characteristic Curve," Biometrics, The International Biometric Society, vol. 62(1), pages 221-229, March.
    2. Pablo Martínez-Camblor & Sonia Pérez-Fernández & Susana Díaz-Coto, 2021. "Optimal classification scores based on multivariate marker transformations," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 105(4), pages 581-599, December.
    3. Margaret Pepe & Tianxi Cai & Zheng Zhang, 2004. "Combining Predictors for Classification Using the Area Under the ROC Curve," UW Biostatistics Working Paper Series 1021, Berkeley Electronic Press.
    4. Daniel J. Luckett & Eric B. Laber & Samer S. El‐Kamary & Cheng Fan & Ravi Jhaveri & Charles M. Perou & Fatma M. Shebl & Michael R. Kosorok, 2021. "Receiver operating characteristic curves and confidence bands for support vector machines," Biometrics, The International Biometric Society, vol. 77(4), pages 1422-1430, December.
    5. Ming-Yueh Huang & Chin-Tsang Chiang, 2017. "Estimation and Inference Procedures for Semiparametric Distribution Models with Varying Linear-Index," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 44(2), pages 396-424, June.
    6. Chin-Tsang Chiang & Shr-Yan Huang, 2009. "Estimation for the Optimal Combination of Markers without Modeling the Censoring Distribution," Biometrics, The International Biometric Society, vol. 65(1), pages 152-158, March.
    7. Jin, Hua & Lu, Ying, 2009. "Permutation test for non-inferiority of the linear to the optimal combination of multiple tests," Statistics & Probability Letters, Elsevier, vol. 79(5), pages 664-669, March.
    8. Debashis Ghosh, 2004. "Semiparametric methods for the binormal model with multiple biomarkers," The University of Michigan Department of Biostatistics Working Paper Series 1046, Berkeley Electronic Press.
    9. Coolen-Maturi, Tahani & Elkhafifi, Faiza F. & Coolen, Frank P.A., 2014. "Three-group ROC analysis: A nonparametric predictive approach," Computational Statistics & Data Analysis, Elsevier, vol. 78(C), pages 69-81.
    10. Holly Janes & Margaret S. Pepe, 2008. "Matching in Studies of Classification Accuracy: Implications for Analysis, Efficiency, and Assessment of Incremental Value," Biometrics, The International Biometric Society, vol. 64(1), pages 1-9, March.
    11. Xin Huang & Gengsheng Qin & Yixin Fang, 2011. "Optimal Combinations of Diagnostic Tests Based on AUC," Biometrics, The International Biometric Society, vol. 67(2), pages 568-576, June.
    12. Dat Huynh & Oliver Laeyendecker & Ron Brookmeyer, 2014. "A serial risk score approach to disease classification that accounts for accuracy and cost," Biometrics, The International Biometric Society, vol. 70(4), pages 1042-1051, December.
    13. Carol Y. Lin & Lance A. Waller & Robert H. Lyles, 2012. "The likelihood approach for the comparison of medical diagnostic system with multiple binary tests," Journal of Applied Statistics, Taylor & Francis Journals, vol. 39(7), pages 1437-1454, December.
    14. Kajal Lahiri & Liu Yang, 2023. "Predicting binary outcomes based on the pair-copula construction," Empirical Economics, Springer, vol. 64(6), pages 3089-3119, June.
    15. Yu-Wei Roy Chen & Janice M Leung & Don D Sin, 2016. "A Systematic Review of Diagnostic Biomarkers of COPD Exacerbation," PLOS ONE, Public Library of Science, vol. 11(7), pages 1-16, July.
    16. Chen, Xiwei & Vexler, Albert & Markatou, Marianthi, 2015. "Empirical likelihood ratio confidence interval estimation of best linear combinations of biomarkers," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 186-198.
    17. Sonia Pérez-Fernández & Pablo Martínez-Camblor & Peter Filzmoser & Norberto Corral, 2021. "Visualizing the decision rules behind the ROC curves: understanding the classification process," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 105(1), pages 135-161, March.
    18. Osamu Komori, 2011. "A boosting method for maximization of the area under the ROC curve," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 63(5), pages 961-979, October.
    19. Debashis Ghosh & Moulinath Banerjee & Pinaki Biswas, 2004. "Binary isotonic regression procedures, with application to cancer biomarkers," The University of Michigan Department of Biostatistics Working Paper Series 1037, Berkeley Electronic Press.
    20. Holly Janes & Gary Longton & Margaret S. Pepe, 2009. "Accommodating covariates in receiver operating characteristic analysis," Stata Journal, StataCorp LLC, vol. 9(1), pages 17-39, March.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:compst:v:39:y:2024:i:2:d:10.1007_s00180-023-01344-6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.