IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v11y2026i5p103-d1934834.html

NSCH-Flourishing-ML: A Curated Dataset and Reproducible Pipeline for Machine Learning Analysis of Child Flourishing

Author

Listed:
  • Miguel Arcos-Argudo

    (Department of Advanced Computing and Data Research Group, Universidad Politécnica Salesiana, Cuenca 010102, Ecuador
    Current address: Department of Advanced Computing and Data Research Group, Universidad Politécnica Salesiana, Turuhuayco Ave. 3-69, Cuenca 010210, Ecuador.)

  • Rodolfo Bojorque

    (Department of Advanced Computing and Data Research Group, Universidad Politécnica Salesiana, Cuenca 010102, Ecuador)

  • Fernando Pesántez

    (Department of Artificial Intelligence and Assistive Technologies, Universidad Politécnica Salesiana, Cuenca 010102, Ecuador)

  • Kely Nieto-Andrade

    (Early Childhood Education Career, Universidad Técnica Particular de Loja, Loja 1101608, Ecuador)

Abstract

Large-scale population surveys provide valuable information for studying child well-being, yet their structure often limits the direct application of machine-learning methods. The National Survey of Children’s Health (NSCH) is one of the most comprehensive datasets for monitoring children’s health and development in the United States, but the raw survey files contain logical skip patterns, categorical variables, and complex survey-design elements that require substantial preprocessing before predictive analysis can be performed. This study presents a curated machine-learning-ready benchmark dataset derived from the 2023 NSCH together with a fully reproducible computational pipeline for studying school-age child flourishing. The workflow constructs a binary flourishing outcome from four survey items related to curiosity, task persistence, emotional self-regulation, and interest in doing well in school. After restricting the sample to children aged 6–17 years and retaining only records with valid responses in all four outcome items, the final analytical dataset contained 32,934 observations. Feature selection based on mutual information computed on the training partition, combined with cross-validated subset-size selection, yielded a final benchmark subset of 150 predictors. Baseline experiments using logistic regression and random forest showed stable and reasonably strong predictive performance, with held-out ROC-AUC values around 0.84–0.85 and closely aligned cross-validation results. An exploratory comparison between weighted and unweighted learning further showed that survey weighting did not improve discriminative performance in this benchmark setting, although the magnitude of the effect was modest and model-dependent. By releasing both the curated benchmark dataset and the reproducible pipeline, this study provides a reusable resource for machine-learning research on child well-being and survey-based computational benchmarking.

Suggested Citation

  • Miguel Arcos-Argudo & Rodolfo Bojorque & Fernando Pesántez & Kely Nieto-Andrade, 2026. "NSCH-Flourishing-ML: A Curated Dataset and Reproducible Pipeline for Machine Learning Analysis of Child Flourishing," Data, MDPI, vol. 11(5), pages 1-18, May.
  • Handle: RePEc:gam:jdataj:v:11:y:2026:i:5:p:103-:d:1934834
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/11/5/103/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/11/5/103/
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;
    ;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:11:y:2026:i:5:p:103-:d:1934834. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.