IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v16y2025i1d10.1038_s41467-025-62237-4.html
   My bibliography  Save this article

High performance data integration for large-scale analyses of incomplete Omic profiles using Batch-Effect Reduction Trees (BERT)

Author

Listed:
  • Yannis Schumann

    (Deutsches Elektronen-Synchrotron DESY)

  • Simon Schlumbohm

    (Helmut-Schmidt-University Hamburg)

  • Julia E. Neumann

    (University Medical Center Hamburg-Eppendorf (UKE)
    University Medical Center Hamburg-Eppendorf (UKE))

  • Philipp Neumann

    (Deutsches Elektronen-Synchrotron DESY
    University of Hamburg)

Abstract

Data from high-throughput technologies assessing global patterns of biomolecules (omic data), is often afflicted with missing values and with measurement-specific biases (batch-effects), that hinder the quantitative comparison of independently acquired datasets. This work introduces batch-effect reduction trees (BERT), a high-performance method for data integration of incomplete omic profiles. We characterize BERT on large-scale data integration tasks with up to 5000 datasets from simulated and experimental data of different quantification techniques and omic types (proteomics, transcriptomics, metabolomics) as well as other datatypes e.g., clinical data, emphasizing the broad scope of the algorithm. Compared to the only available method for integration of incomplete omic data, HarmonizR, our method (1) retains up to five orders of magnitude more numeric values, (2) leverages multi-core and distributed-memory systems for up to 11 × runtime improvement (3) considers covariates and reference measurements to account for severely imbalanced or sparsely distributed conditions (up to 2 × improvement of average-silhouette-width).

Suggested Citation

  • Yannis Schumann & Simon Schlumbohm & Julia E. Neumann & Philipp Neumann, 2025. "High performance data integration for large-scale analyses of incomplete Omic profiles using Batch-Effect Reduction Trees (BERT)," Nature Communications, Nature, vol. 16(1), pages 1-13, December.
  • Handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-62237-4
    DOI: 10.1038/s41467-025-62237-4
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-025-62237-4
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-025-62237-4?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Tara Chari & Lior Pachter, 2023. "The specious art of single-cell genomics," PLOS Computational Biology, Public Library of Science, vol. 19(8), pages 1-20, August.
    2. Charlotte Soneson & Sarah Gerster & Mauro Delorenzi, 2014. "Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation," PLOS ONE, Public Library of Science, vol. 9(6), pages 1-13, June.
    3. Josse, Julie & Husson, François, 2016. "missMDA: A Package for Handling Missing Values in Multivariate Data Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 70(i01).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Paolo Fornaro & Henri Luomaranta, 2020. "Nowcasting Finnish real economic activity: a machine learning approach," Empirical Economics, Springer, vol. 58(1), pages 55-71, January.
    2. Seán Schmitz & Sophia Becker & Laura Weiand & Norman Niehoff & Frank Schwartzbach & Erika von Schneidemesser, 2019. "Determinants of Public Acceptance for Traffic-Reducing Policies to Improve Urban Air Quality," Sustainability, MDPI, vol. 11(14), pages 1-16, July.
    3. Jakob Fiedler & Josef Ruzicka & Thomas Theobald, 2019. "The Real-Time Information Content of Financial Stress and Bank Lending on European Business Cycles," IMK Working Paper 198-2019, IMK at the Hans Boeckler Foundation, Macroeconomic Policy Institute.
    4. Jiang, Wei & Josse, Julie & Lavielle, Marc, 2020. "Logistic regression with missing covariates—Parameter estimation, model selection and prediction within a joint-modeling framework," Computational Statistics & Data Analysis, Elsevier, vol. 145(C).
    5. Ettie M. Lipner & Joshua French & Carleton R. Bern & Katherine Walton-Day & David Knox & Michael Strong & D. Rebecca Prevots & James L. Crooks, 2020. "Nontuberculous Mycobacterial Disease and Molybdenum in Colorado Watersheds," IJERPH, MDPI, vol. 17(11), pages 1-15, May.
    6. Alex Armand & Paul Atwell & Joseph F. Gomes & Yannik Schenk, 2023. "It’s a Bird, it’s a Plane, it’s Superman! Using Mass Media to fight Intolerance," LIDAM Discussion Papers IRES 2023012, Université catholique de Louvain, Institut de Recherches Economiques et Sociales (IRES).
    7. Albers, Thilo N. H. & Kersting, Felix & Kosse, Fabian, 2023. "Income misperception and populism," W.E.P. - Würzburg Economic Papers 104, University of Würzburg, Department of Economics.
    8. Julien Broséus & Sébastien Hergalant & Julia Vogt & Eugen Tausch & Markus Kreuz & Anja Mottok & Christof Schneider & Caroline Dartigeas & Damien Roos-Weil & Anne Quinquenel & Charline Moulin & German , 2023. "Molecular characterization of Richter syndrome identifies de novo diffuse large B-cell lymphomas with poor prognosis," Nature Communications, Nature, vol. 14(1), pages 1-19, December.
    9. Pépin, Antonin & Morel, Kevin & van der Werf, Hayo M.G., 2021. "Conventionalised vs. agroecological practices on organic vegetable farms: Investigating the influence of farm structure in a bifurcation perspective," Agricultural Systems, Elsevier, vol. 190(C).
    10. Tomas Adam & Filip Novotny, 2018. "Assessing the External Demand of the Czech Economy: Nowcasting Foreign GDP Using Bridge Equations," Working Papers 2018/18, Czech National Bank, Research and Statistics Department.
    11. Marc Jan Bonder & Stephen J. Clark & Felix Krueger & Siyuan Luo & João Agostinho de Sousa & Aida M. Hashtroud & Thomas M. Stubbs & Anne-Katrien Stark & Steffen Rulands & Oliver Stegle & Wolf Reik & Fe, 2024. "scEpiAge: an age predictor highlighting single-cell ageing heterogeneity in mouse blood," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    12. Julie Talbot & Joanna Fombonne & Jacob Torrejon & Benjamin R. Babcock & Leon F. McSwain & Nicolas Rama & Ludovica Lospinoso Severini & Emma Bonerandi & Veronique Marsaud & Flavia Bernardi & Tarek Ghar, 2025. "Sonic hedgehog medulloblastomas are dependent on Netrin-1 for survival," Nature Communications, Nature, vol. 16(1), pages 1-20, December.
    13. Cahen-Fourot, Louison & Campiglio, Emanuele & Dawkins, Elena & Godin, Antoine & Kemp-Benedict, Eric, 2020. "Looking for the Inverted Pyramid: An Application Using Input-Output Networks," Ecological Economics, Elsevier, vol. 169(C).
    14. Valter Cesar de Souza & Sergio Augusto Rodrigues & Luís Roberto Almeida Gabriel Filho, 2024. "Comparison of principal component analysis algorithms for imputation in agrometeorological data in high dimension and reduced sample size," PLOS ONE, Public Library of Science, vol. 19(12), pages 1-20, December.
    15. Kowarik, Alexander & Templ, Matthias, 2016. "Imputation with the R Package VIM," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i07).
    16. Althouse, Jeffrey & Cahen-Fourot, Louison & Carballa-Smichowski, Bruno & Durand, Cédric & Knauss, Steven, 2023. "Ecologically unequal exchange and uneven development patterns along global value chains," World Development, Elsevier, vol. 170(C).
    17. Albrizio, R. & Puig-Sirera, À. & Sellami, M.H. & Guida, G. & Basile, A. & Bonfante, A. & Gambuti, A. & Giorio, P., 2023. "Water stress, yield, and grape quality in a hilly rainfed “Aglianico” vineyard grown in two different soils along a slope," Agricultural Water Management, Elsevier, vol. 279(C).
    18. Schalk Burger & Searle Silverman & Gary van Vuuren, 2018. "Deriving Correlation Matrices for Missing Financial Time-Series Data," International Journal of Economics and Finance, Canadian Center of Science and Education, vol. 10(10), pages 105-105, October.
    19. Thilo N. H. Albers & Felix Kersting & Fabian Kosse, 2022. "Income Misperception and Populism," SOEPpapers on Multidisciplinary Panel Data Research 1177, DIW Berlin, The German Socio-Economic Panel (SOEP).
    20. Akira Shinkyu, 2023. "Forward Selection for Feature Screening and Structure Identification in Varying Coefficient Models," Sankhya A: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 85(1), pages 485-511, February.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-62237-4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.