Author
Listed:
- Dennis Khodasevich
- Nina Holland
- Lars van der Laan
- Andres Cardenas
Abstract
Background: DNA methylation (DNAm) provides a window to characterize the impacts of environmental exposures and the biological aging process. Epigenetic clocks are often trained on DNAm using penalized regression of CpG sites, but recent evidence suggests potential benefits of training epigenetic predictors on principal components. Methodology/findings: We developed a pipeline to simultaneously train three epigenetic predictors; a traditional CpG Clock, a PCA Clock, and a SuperLearner PCA Clock (SL PCA). We gathered publicly available DNAm datasets to generate i) a novel childhood epigenetic clock, ii) a reconstructed Hannum adult blood clock, and iii) as a proof of concept, a predictor of polybrominated biphenyl exposure using the three developmental methodologies. We used correlation coefficients and median absolute error to assess fit between predicted and observed measures, as well as agreement between duplicates. The SL PCA clocks improved fit with observed phenotypes relative to the PCA clocks or CpG clocks across several datasets. We found evidence for higher agreement between duplicate samples run on alternate DNAm arrays when using SL PCA clocks relative to traditional methods. Analyses examining associations between relevant exposures and epigenetic age acceleration (EAA) produced more precise effect estimates when using predictions derived from SL PCA clocks. Conclusions: We introduce a novel method for the development of DNAm-based predictors that combines the improved reliability conferred by training on principal components with advanced ensemble-based machine learning. Coupling SuperLearner with PCA in the predictor development process may be especially relevant for studies with longitudinal designs utilizing multiple array types, as well as for the development of predictors of more complex phenotypic traits. Author summary: DNA methylation functions as a vital interface between genes and environment. A wide range of epigenetic predictors have harnessed DNA methylation data to address a variety of research questions including improving our understanding of the biological aging process and characterizing past exposure to environmental toxins. However, the methodology used to develop most existing epigenetic predictors is subject to several limitations including the influence of technical variables, batch effects, and difficulty modeling complex relationships between the variable of interest and DNA methylation. Here, we introduce a novel method for the development of epigenetic predictors that combines the improved reliability conferred by training on principal components with advanced ensemble-based machine learning. We demonstrate the potential benefits of this novel procedure by developing a novel childhood epigenetic clock, reconstructing the Hannum clock, and producing a predictor of polybrominated biphenyl exposure. This novel training methodology may be especially relevant for the development of epigenetic predictors of complex phenotypic traits, which often suffer from poor performance using the traditional development methodology, and for the improvement of the reliability of epigenetic clocks for studies with longitudinal designs utilizing multiple array types.
Suggested Citation
Dennis Khodasevich & Nina Holland & Lars van der Laan & Andres Cardenas, 2025.
"A SuperLearner-based pipeline for the development of DNA methylation-derived predictors of phenotypic traits,"
PLOS Computational Biology, Public Library of Science, vol. 21(2), pages 1-20, February.
Handle:
RePEc:plo:pcbi00:1012768
DOI: 10.1371/journal.pcbi.1012768
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1012768. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.