IDEAS home Printed from https://ideas.repec.org/a/bpj/ijbist/v8y2012i1n17.html

Comparative Evaluation of Classifiers in the Presence of Statistical Interactions between Features in High Dimensional Data Settings

Author

Listed:
  • Guo Yu

    (BG Medicine, Inc.)

  • Balasubramanian Raji

    (University of Massachusetts – Amherst)

Abstract

Background: A central challenge in high dimensional data settings in biomedical investigations involves the estimation of an optimal prediction algorithm to distinguish between different disease phenotypes. A significant complicating aspect in these analyses can be attributed to the presence of features that exhibit statistical interactions. Indeed, in several clinical investigations such as genetic studies of complex diseases, it is of interest to specifically identify such features. In this paper, we compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in settings involving high dimensional datasets including statistically interacting feature subsets. We evaluate the performance of these classifiers under conditions of varying sample size, levels of signal-to-noise ratio and strength of statistical interactions among features. We summarize two datasets from studies in diabetes and cardiovascular disease involving gene expression, metabolomics and proteomics measurements and compare results obtained using the four classifiers.Results: Simulation studies revealed that the classifier Prediction Analysis of Microarrays had the highest classification accuracy in the absence of noise, statistical interactions and when feature distributions were multivariate Gaussian within each class. In the presence of statistical interactions, modest effect sizes and the absence of noise, Support Vector Machines achieved the best performance followed closely by Random Forests. Random Forests was optimal in settings that included both significant levels of high dimensional noise features and statistical interactions between biomarker pairs. The data applications revealed similar trends in the relative performances of each classifier.Conclusion: Random Forests had the highest classification accuracy among the four classifiers and was successful in incorporating interaction effects between features in the presence of noise in high dimensional datasets.

Suggested Citation

  • Guo Yu & Balasubramanian Raji, 2012. "Comparative Evaluation of Classifiers in the Presence of Statistical Interactions between Features in High Dimensional Data Settings," The International Journal of Biostatistics, De Gruyter, vol. 8(1), pages 1-32, June.
  • Handle: RePEc:bpj:ijbist:v:8:y:2012:i:1:n:17
    DOI: 10.1515/1557-4679.1373
    as

    Download full text from publisher

    File URL: https://doi.org/10.1515/1557-4679.1373
    Download Restriction: For access to full text, subscription to the journal or payment for the individual article is required.

    File URL: https://libkey.io/10.1515/1557-4679.1373?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to

    for a different version of it.

    References listed on IDEAS

    as
    1. Dudoit S. & Fridlyand J. & Speed T. P, 2002. "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," Journal of the American Statistical Association, American Statistical Association, vol. 97, pages 77-87, March.
    2. Constantin F Aliferis & Alexander Statnikov & Ioannis Tsamardinos & Jonathan S Schildcrout & Bryan E Shepherd & Frank E Harrell Jr., 2009. "Factors Influencing the Statistical Power of Complex Data Analysis Protocols for Molecular Signature Development from Microarray Data," PLOS ONE, Public Library of Science, vol. 4(3), pages 1-7, March.
    3. John D. Storey, 2002. "A direct approach to false discovery rates," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 64(3), pages 479-498, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Won, Joong-Ho & Lim, Johan & Yu, Donghyeon & Kim, Byung Soo & Kim, Kyunga, 2014. "Monotone false discovery rate," Statistics & Probability Letters, Elsevier, vol. 87(C), pages 86-93.
    2. M. Kathleen Kerr, 2003. "Design Considerations for Efficient and Effective Microarray Studies," Biometrics, The International Biometric Society, vol. 59(4), pages 822-828, December.
    3. Giuseppe Jurman & Samantha Riccadonna & Roberto Visintainer & Cesare Furlanello, 2012. "Algebraic Comparison of Partial Lists in Bioinformatics," PLOS ONE, Public Library of Science, vol. 7(5), pages 1-20, May.
    4. Yu-Min Yen, 2013. "Testing Jumps via False Discovery Rate Control," PLOS ONE, Public Library of Science, vol. 8(4), pages 1-15, April.
    5. Boulesteix, Anne-Laure & Tutz, Gerhard, 2006. "Identification of interaction patterns and classification with applications to microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 50(3), pages 783-802, February.
    6. Youngchao Ge & Sandrine Dudoit & Terence Speed, 2003. "Resampling-based multiple testing for microarray data analysis," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 12(1), pages 1-77, June.
    7. Mark Rempel, 2016. "Improving Overnight Loan Identification in Payments Systems," Journal of Money, Credit and Banking, Blackwell Publishing, vol. 48(2-3), pages 549-564, March.
    8. John A. Dawson & Christina Kendziorski, 2012. "An Empirical Bayesian Approach for Identifying Differential Coexpression in High-Throughput Experiments," Biometrics, The International Biometric Society, vol. 68(2), pages 455-465, June.
    9. Novoselova Natalia & Tom Igor & Borisov Arkady & Polaka Inese, 2013. "Feature Ranking by Classification Accuracy Estimation of Multiple Data Samples," Information Technology and Management Science, Sciendo, vol. 16(1), pages 95-100, December.
    10. Matthias Bogaert & Michel Ballings & Martijn Hosten & Dirk Van den Poel, 2017. "Identifying Soccer Players on Facebook Through Predictive Analytics," Decision Analysis, INFORMS, vol. 14(4), pages 274-297, December.
    11. Wang, Tao & Zhu, Lixing, 2013. "Sparse sufficient dimension reduction using optimal scoring," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 223-232.
    12. Timothy B. Armstrong & Michal Kolesár & Mikkel Plagborg‐Møller, 2022. "Robust Empirical Bayes Confidence Intervals," Econometrica, Econometric Society, vol. 90(6), pages 2567-2602, November.
    13. Cuthbertson, Keith & Nitzsche, Dirk & O'Sullivan, Niall, 2008. "UK mutual fund performance: Skill or luck?," Journal of Empirical Finance, Elsevier, vol. 15(4), pages 613-634, September.
    14. Michael Hankin & Jay Bartroff, 2026. "Sequential FDR and pFDR Control Under Arbitrary Dependence, with Application to Pharmacovigilance Database Monitoring," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 18(1), pages 150-175, March.
    15. Guan-Hua Huang & Su-Mei Wang & Chung-Chu Hsu, 2011. "Optimization-Based Model Fitting for Latent Class and Latent Profile Analyses," Psychometrika, Springer;The Psychometric Society, vol. 76(4), pages 584-611, October.
    16. Bajgrowicz, Pierre & Scaillet, Olivier, 2012. "Technical trading revisited: False discoveries, persistence tests, and transaction costs," Journal of Financial Economics, Elsevier, vol. 106(3), pages 473-491.
    17. Kubokawa, Tatsuya & Srivastava, Muni S., 2008. "Estimation of the precision matrix of a singular Wishart distribution and its application in high-dimensional data," Journal of Multivariate Analysis, Elsevier, vol. 99(9), pages 1906-1928, October.
    18. Lee, Jae Won & Lee, Jung Bok & Park, Mira & Song, Seuck Heun, 2005. "An extensive comparison of recent classification tools applied to microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 48(4), pages 869-885, April.
    19. Liao Zhu & Sumanta Basu & Robert A. Jarrow & Martin T. Wells, 2020. "High-Dimensional Estimation, Basis Assets, and the Adaptive Multi-Factor Model," Quarterly Journal of Finance (QJF), World Scientific Publishing Co. Pte. Ltd., vol. 10(04), pages 1-52, December.
    20. Jansen, Nora & Hinz, Oliver & Deusser, Clemens & Strufe, Thorsten, 2021. "Is the Buzz on? – A Buzz Detection System for Viral Posts in Social Media," Journal of Interactive Marketing, Elsevier, vol. 56(C), pages 1-17.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bpj:ijbist:v:8:y:2012:i:1:n:17. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Peter Golla (email available below). General contact details of provider: https://www.degruyterbrill.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.