IDEAS home Printed from https://ideas.repec.org/a/gam/jsusta/v14y2022i20p13685-d950039.html
   My bibliography  Save this article

Classification of Cotton Genotypes with Mixed Continuous and Categorical Variables: Application of Machine Learning Models

Author

Listed:
  • Sudha Bishnoi

    (Department of Mathematics and Statistics, Chaudhary Charan Singh Haryana Agricultural University, Hisar 125004, Haryana, India)

  • Nadhir Al-Ansari

    (Department of Civil, Environmental and Natural Resources Engineering, Lulea University of Technology, 97187 Lulea, Sweden)

  • Mujahid Khan

    (Agricultural Research Station, Sri Karan Narendra Agriculture University, Jobner 332301, Rajasthan, India)

  • Salim Heddam

    (Agronomy Department, Faculty of Science, Hydraulics Division University, 20 Août 1955, Route El Hadaik, BP 26, Skikda 21024, Algeria)

  • Anurag Malik

    (Regional Research Station, Punjab Agricultural University, Bathinda 151001, Punjab, India)

Abstract

Mixed data is a combination of continuous and categorical variables and occurs frequently in fields such as agriculture, remote sensing, biology, medical science, marketing, etc., but only limited work has been done with this type of data. In this study, data on continuous and categorical characters of 452 genotypes of cotton ( Gossypium hirsutum ) were obtained from an experiment conducted by the Central Institute of Cotton Research (CICR), Sirsa, Haryana (India) during the Kharif season of the year 2018–2019. The machine learning (ML) classifiers/models, namely k-nearest neighbor (KNN), Classification and Regression Tree (CART), C4.5, Naïve Bayes, random forest (RF), bagging, and boosting were considered for cotton genotypes classification. The performance of these ML classifiers was compared to each other along with the linear discriminant analysis (LDA) and logistic regression. The holdout method was used for cross-validation with an 80:20 ratio of training and testing data. The results of the appraisal based on hold-out cross-validation showed that the RF and AdaBoost performed very well, having only two misclassifications with the same accuracy of 97.26% and the error rate of 2.74%. The LDA classifier performed the worst in terms of accuracy, with nine misclassifications. The other performance measures, namely sensitivity, specificity, precision, F1 score, and G-mean, were all together used to find out the best ML classifier among all those considered. Moreover, the RF and AdaBoost algorithms had the highest value of all the performance measures, with 96.97% sensitivity and 97.50% specificity. Thus, these models were found to be the best in classifying the low- and high-yielding cotton genotypes.

Suggested Citation

  • Sudha Bishnoi & Nadhir Al-Ansari & Mujahid Khan & Salim Heddam & Anurag Malik, 2022. "Classification of Cotton Genotypes with Mixed Continuous and Categorical Variables: Application of Machine Learning Models," Sustainability, MDPI, vol. 14(20), pages 1-17, October.
  • Handle: RePEc:gam:jsusta:v:14:y:2022:i:20:p:13685-:d:950039
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2071-1050/14/20/13685/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2071-1050/14/20/13685/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. W. Krzanowski, 1993. "The location model for mixtures of categorical and continuous variables," Journal of Classification, Springer;The Classification Society, vol. 10(1), pages 25-49, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. A. R. de Leon & A. Soo & T. Williamson, 2011. "Classification with discrete and continuous variables via general mixed-data models," Journal of Applied Statistics, Taylor & Francis Journals, vol. 38(5), pages 1021-1032, February.
    2. Colin O. Wu & Gang Zheng & Minjung Kwak, 2013. "A Joint Regression Analysis for Genetic Association Studies with Outcome Stratified Samples," Biometrics, The International Biometric Society, vol. 69(2), pages 417-426, June.
    3. Leung, Chi-Ying, 2003. "The effect of across-location heteroscedasticity on the classification of mixed categorical and continuous data," Journal of Multivariate Analysis, Elsevier, vol. 84(2), pages 369-386, February.
    4. Layal Christine Lettry, 2023. "Clustering the Swiss Pension Register," FSES Working Papers 529, Faculty of Economics and Social Sciences, University of Freiburg/Fribourg Switzerland.
    5. Nor Mahat & W.J. Krzanowski & A. Hernandez, 2009. "Strategies for Non-Parametric Smoothing of the Location Model in Mixed-Variable Discriminant Analysis," Modern Applied Science, Canadian Center of Science and Education, vol. 3(1), pages 151-151, January.
    6. Chi-Ying Leung, 2001. "Error rates in classification consisting of discrete and continuous variables in the presence of covariates," Statistical Papers, Springer, vol. 42(2), pages 265-273, April.
    7. Merbouha, A. & Mkhadri, A., 2004. "Regularization of the location model in discrimination with mixed discrete and continuous variables," Computational Statistics & Data Analysis, Elsevier, vol. 45(3), pages 563-576, April.
    8. Thomas Bittmann & Jens‐Peter Loy & Sven Anders, 2020. "Product differentiation and cost pass‐through: industry‐wide versus firm‐specific cost shocks," Australian Journal of Agricultural and Resource Economics, Australian Agricultural and Resource Economics Society, vol. 64(4), pages 1184-1209, October.
    9. Leung, Chi-Ying, 2005. "Regularized classification for mixed continuous and categorical variables under across-location heteroscedasticity," Journal of Multivariate Analysis, Elsevier, vol. 93(2), pages 358-374, April.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jsusta:v:14:y:2022:i:20:p:13685-:d:950039. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.