Author
Listed:
- Jacob Gould Ellen
- Chrystinne Fernandes
- Martin Viola
- Keagan Yap
- Arinda Jordan
- Mutesi Flavia Kirabo
- João Matos
- Pedro Moreira
- Leo Anthony Celi
Abstract
Clinical research studies routinely apply exclusion criteria and data preprocessing steps that can substantially alter dataset composition, potentially introducing hidden biases that affect validity and generalizability. This is particularly important in artificial intelligence/machine learning (AI/ML) studies where models learn patterns directly from training data. We developed Equiflow, an open-source Python package that automates creation of enhanced participant flow diagrams tracking both sample size and composition changes throughout studies. Equiflow quantifies distributional shifts at each exclusion step and generates visualizations showing how key clinical and demographic variables evolve during participant selection. In a case study of sepsis patients from the eICU database, sequential exclusions reduced the sample from 126,750–1,094 patients. Requiring non-missing troponin measurements in the final step of data processing caused substantial demographic shifts that would typically remain invisible in traditional reporting. By making compositional biases visible during cohort construction before modeling begins, Equiflow enables researchers to make informed decisions about analyses and acknowledge limitations in generalizability to their readers. This standardized, open-source approach promotes transparency in clinical research and supports development of more equitable clinical AI systems, addressing a critical need as healthcare increasingly relies on data-driven decision making.Author summary: Medical research studies filter participants through multiple steps, often removing those with missing data, applying clinical criteria, or excluding based on demographic factors. While each step may seem routine, the cumulative effect can dramatically reshape who remains in the final dataset, introducing hidden biases that undermine study validity and generalizability. This problem is particularly concerning in AI applications, where algorithms learn directly from training data and can perpetuate healthcare disparities. We developed Equiflow, a free, open-source Python tool that automatically generates visual diagrams tracking how a study population changes at each filtering step. Unlike traditional reporting methods that show only participant counts, Equiflow reveals compositional shifts, such as whether excluding patients with missing lab values disproportionately removes certain demographic groups. We describe two case studies using real-world ICU data showing how routine exclusion criteria can alter fundamental characteristics of a cohort. These shifts, invisible in standard reporting, could affect which patients benefit from resulting clinical tools. By making such biases visible early in the research process, Equiflow enables researchers to make informed decisions and transparently acknowledge limitations in their findings.
Suggested Citation
Jacob Gould Ellen & Chrystinne Fernandes & Martin Viola & Keagan Yap & Arinda Jordan & Mutesi Flavia Kirabo & João Matos & Pedro Moreira & Leo Anthony Celi, 2026.
"Equiflow: An open-source software package for evaluating changes in cohort composition,"
PLOS Digital Health, Public Library of Science, vol. 5(4), pages 1-15, April.
Handle:
RePEc:plo:pdig00:0001342
DOI: 10.1371/journal.pdig.0001342
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pdig00:0001342. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: digitalhealth (email available below). General contact details of provider: https://journals.plos.org/digitalhealth .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.