IDEAS home Printed from https://ideas.repec.org/a/bla/jorssa/v184y2021i3p791-811.html
   My bibliography  Save this article

Removing the influence of group variables in high‐dimensional predictive modelling

Author

Listed:
  • Emanuele Aliverti
  • Kristian Lum
  • James E. Johndrow
  • David B. Dunson

Abstract

In many application areas, predictive models are used to support or make important decisions. There is increasing awareness that these models may contain spurious or otherwise undesirable correlations. Such correlations may arise from a variety of sources, including batch effects, systematic measurement errors or sampling bias. Without explicit adjustment, machine learning algorithms trained using these data can produce out‐of‐sample predictions which propagate these undesirable correlations. We propose a method to pre‐process the training data, producing an adjusted dataset that is statistically independent of the nuisance variables with minimum information loss. We develop a conceptually simple approach for creating an adjusted dataset in high‐dimensional settings based on a constrained form of matrix decomposition. The resulting dataset can then be used in any predictive algorithm with the guarantee that predictions will be statistically independent of the nuisance variables. We develop a scalable algorithm for implementing the method, along with theory support in the form of independence guarantees and optimality. The method is illustrated on some simulation examples and applied to two case studies: removing machine‐specific correlations from brain scan data, and removing ethnicity information from a dataset used to predict recidivism. That the motivation for removing undesirable correlations is quite different in the two applications illustrates the broad applicability of our approach.

Suggested Citation

  • Emanuele Aliverti & Kristian Lum & James E. Johndrow & David B. Dunson, 2021. "Removing the influence of group variables in high‐dimensional predictive modelling," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(3), pages 791-811, July.
  • Handle: RePEc:bla:jorssa:v:184:y:2021:i:3:p:791-811
    DOI: 10.1111/rssa.12613
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/rssa.12613
    Download Restriction: no

    File URL: https://libkey.io/10.1111/rssa.12613?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Dunson, David B., 2018. "Statistics in the big data era: Failures of the machine," Statistics & Probability Letters, Elsevier, vol. 136(C), pages 4-9.
    2. Jeffrey T Leek & John D Storey, 2007. "Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis," PLOS Genetics, Public Library of Science, vol. 3(9), pages 1-12, September.
    3. Richard Berk & Hoda Heidari & Shahin Jabbari & Michael Kearns & Aaron Roth, 2021. "Fairness in Criminal Justice Risk Assessments: The State of the Art," Sociological Methods & Research, , vol. 50(1), pages 3-44, February.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yuto Hasegawa & Juhyun Kim & Gianluca Ursini & Yan Jouroukhin & Xiaolei Zhu & Yu Miyahara & Feiyi Xiong & Samskruthi Madireddy & Mizuho Obayashi & Beat Lutz & Akira Sawa & Solange P. Brown & Mikhail V, 2023. "Microglial cannabinoid receptor type 1 mediates social memory deficits in mice produced by adolescent THC exposure and 16p11.2 duplication," Nature Communications, Nature, vol. 14(1), pages 1-19, December.
    2. Christophe Hurlin & Christophe Perignon & Sébastien Saurin, 2021. "The Fairness of Credit Scoring Models," Working Papers hal-03501452, HAL.
    3. Yoan Hermstrüwer & Pascal Langenbach, 2022. "Fair Governance with Humans and Machines," Discussion Paper Series of the Max Planck Institute for Research on Collective Goods 2022_04, Max Planck Institute for Research on Collective Goods, revised 01 Mar 2023.
    4. Paola Perchinunno & Massimo Bilancia & Domenico Vitale, 2021. "A Statistical Analysis of Factors Affecting Higher Education Dropouts," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 156(2), pages 341-362, August.
    5. Chakraborty, Tanujit & Chakraborty, Ashis Kumar & Murthy, C.A., 2019. "A nonparametric ensemble binary classifier and its statistical properties," Statistics & Probability Letters, Elsevier, vol. 149(C), pages 16-23.
    6. Lu, Xuefei & Borgonovo, Emanuele, 2023. "Global sensitivity analysis in epidemiological modeling," European Journal of Operational Research, Elsevier, vol. 304(1), pages 9-24.
    7. Charlotte Soneson & Sarah Gerster & Mauro Delorenzi, 2014. "Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation," PLOS ONE, Public Library of Science, vol. 9(6), pages 1-13, June.
    8. Arjun Bhattacharya & Anastasia N. Freedman & Vennela Avula & Rebeca Harris & Weifang Liu & Calvin Pan & Aldons J. Lusis & Robert M. Joseph & Lisa Smeester & Hadley J. Hartwell & Karl C. K. Kuban & Car, 2022. "Placental genomics mediates genetic associations with complex health traits and disease," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    9. Sudhir Varma, 2020. "Blind estimation and correction of microarray batch effect," PLOS ONE, Public Library of Science, vol. 15(4), pages 1-15, April.
    10. Blum Yuna & Houée-Bigot Magalie & Causeur David, 2016. "Sparse factor model for co-expression networks with an application using prior biological knowledge," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 15(3), pages 253-272, June.
    11. Zachary R. McCaw & Sheila M. Gaynor & Ryan Sun & Xihong Lin, 2023. "Leveraging a surrogate outcome to improve inference on a partially missing target outcome," Biometrics, The International Biometric Society, vol. 79(2), pages 1472-1484, June.
    12. Anna Langenberg & Shih-Chi Ma & Tatiana Ermakova & Benjamin Fabian, 2023. "Formal Group Fairness and Accuracy in Automated Decision Making," Mathematics, MDPI, vol. 11(8), pages 1-25, April.
    13. Angela Tung & Megan M. Sperry & Wesley Clawson & Ananya Pavuluri & Sydney Bulatao & Michelle Yue & Ramses Martinez Flores & Vaibhav P. Pai & Patrick McMillen & Franz Kuchling & Michael Levin, 2024. "Embryos assist morphogenesis of others through calcium and ATP signaling mechanisms in collective teratogen resistance," Nature Communications, Nature, vol. 15(1), pages 1-22, December.
    14. Friguet, Chloé & Causeur, David, 2011. "Estimation of the proportion of true null hypotheses in high-dimensional data under dependence," Computational Statistics & Data Analysis, Elsevier, vol. 55(9), pages 2665-2676, September.
    15. Kozodoi, Nikita & Jacob, Johannes & Lessmann, Stefan, 2022. "Fairness in credit scoring: Assessment, implementation and profit implications," European Journal of Operational Research, Elsevier, vol. 297(3), pages 1083-1094.
    16. repec:plo:pcbi00:1008366 is not listed on IDEAS
    17. Jonathan M. Dreyfuss & Yixing Yuchi & Xuehong Dong & Vissarion Efthymiou & Hui Pan & Donald C. Simonson & Ashley Vernon & Florencia Halperin & Pratik Aryal & Anish Konkar & Yinong Sebastian & Brandon , 2021. "High-throughput mediation analysis of human proteome and metabolome identifies mediators of post-bariatric surgical diabetes control," Nature Communications, Nature, vol. 12(1), pages 1-13, December.
    18. repec:jss:jstsof:40:i14 is not listed on IDEAS
    19. Xuemeng Zhou & Tsz Wing Sam & Ah Young Lee & Danny Leung, 2021. "Mouse strain-specific polymorphic provirus functions as cis-regulatory element leading to epigenomic and transcriptomic variations," Nature Communications, Nature, vol. 12(1), pages 1-18, December.
    20. Michael W Nagle & Jeanne C Latourelle & Adam Labadorf & Alexandra Dumitriu & Tiffany C Hadzi & Thomas G Beach & Richard H Myers, 2016. "The 4p16.3 Parkinson Disease Risk Locus Is Associated with GAK Expression and Genes Involved with the Synaptic Vesicle Membrane," PLOS ONE, Public Library of Science, vol. 11(8), pages 1-14, August.
    21. Lily Monnier & Paul-Henry Cournède, 2024. "A novel batch-effect correction method for scRNA-seq data based on Adversarial Information Factorization," PLOS Computational Biology, Public Library of Science, vol. 20(2), pages 1-22, February.
    22. Kigerl, Alex & Hamilton, Zachary & Kowalski, Melissa & Mei, Xiaohan, 2022. "The great methods bake-off: Comparing performance of machine learning algorithms," Journal of Criminal Justice, Elsevier, vol. 82(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jorssa:v:184:y:2021:i:3:p:791-811. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/rssssea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.