IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0154515.html
   My bibliography  Save this article

Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis

Author

Listed:
  • Shang-Ming Zhou
  • Fabiola Fernandez-Gutierrez
  • Jonathan Kennedy
  • Roxanne Cooksey
  • Mark Atkinson
  • Spiros Denaxas
  • Stefan Siebert
  • William G Dixon
  • Terence W O’Neill
  • Ernest Choy
  • Cathie Sudlow
  • UK Biobank Follow-up and Outcomes Group
  • Sinead Brophy

Abstract

Objectives: 1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rheumatoid arthritis using primary care EHRs. Methods: This study linked routine primary and secondary care EHRs in Wales, UK. A machine learning based scheme was used to identify patients with rheumatoid arthritis from primary care EHRs via the following steps: i) selection of variables by comparing relative frequencies of Read codes in the primary care dataset associated with disease case compared to non-disease control (disease/non-disease based on the secondary care diagnosis); ii) reduction of predictors/associated variables using a Random Forest method, iii) induction of decision rules from decision tree model. The proposed method was then extensively validated on an independent dataset, and compared for performance with two existing deterministic algorithms for RA which had been developed using expert clinical knowledge. Results: Primary care EHRs were available for 2,238,360 patients over the age of 16 and of these 20,667 were also linked in the secondary care rheumatology clinical system. In the linked dataset, 900 predictors (out of a total of 43,100 variables) in the primary care record were discovered more frequently in those with versus those without RA. These variables were reduced to 37 groups of related clinical codes, which were used to develop a decision tree model. The final algorithm identified 8 predictors related to diagnostic codes for RA, medication codes, such as those for disease modifying anti-rheumatic drugs, and absence of alternative diagnoses such as psoriatic arthritis. The proposed data-driven method performed as well as the expert clinical knowledge based methods. Conclusion: Data-driven scheme, such as ensemble machine learning methods, has the potential of identifying the most informative predictors in a cost-effective and rapid way to accurately and reliably classify rheumatoid arthritis or other complex medical conditions in primary care EHRs.

Suggested Citation

  • Shang-Ming Zhou & Fabiola Fernandez-Gutierrez & Jonathan Kennedy & Roxanne Cooksey & Mark Atkinson & Spiros Denaxas & Stefan Siebert & William G Dixon & Terence W O’Neill & Ernest Choy & Cathie Sudlow, 2016. "Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis," PLOS ONE, Public Library of Science, vol. 11(5), pages 1-14, May.
  • Handle: RePEc:plo:pone00:0154515
    DOI: 10.1371/journal.pone.0154515
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0154515
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0154515&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0154515?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Ishwaran, Hemant & Kogalur, Udaya B. & Gorodeski, Eiran Z. & Minn, Andy J. & Lauer, Michael S., 2010. "High-Dimensional Variable Selection for Survival Data," Journal of the American Statistical Association, American Statistical Association, vol. 105(489), pages 205-217.
    2. Jeffrey S. Racine, 2012. "RStudio: A Platform‐Independent IDE for R and Sweave," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 27(1), pages 167-172, January.
    3. Shang-Ming Zhou & Ronan A Lyons & Owen G Bodger & Ann John & Huw Brunt & Kerina Jones & Mike B Gravenor & Sinead Brophy, 2014. "Local Modelling Techniques for Assessing Micro-Level Impacts of Risk Factors in Complex Data: Understanding Health and Socioeconomic Inequalities in Childhood Educational Attainments," PLOS ONE, Public Library of Science, vol. 9(11), pages 1-14, November.
    4. Shang-Ming Zhou & Ronan A Lyons & Sinead Brophy & Mike B Gravenor, 2012. "Constructing Compact Takagi-Sugeno Rule Systems: Identification of Complex Interactions in Epidemiological Data," PLOS ONE, Public Library of Science, vol. 7(12), pages 1-14, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zemin Zheng & Jie Zhang & Yang Li, 2022. "L 0 -Regularized Learning for High-Dimensional Additive Hazards Regression," INFORMS Journal on Computing, INFORMS, vol. 34(5), pages 2762-2775, September.
    2. Riza, Lala Septem & Bergmeir, Christoph & Herrera, Francisco & Benítez, José M., 2015. "frbs: Fuzzy Rule-Based Systems for Classification and Regression in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 65(i06).
    3. Jung-sik Hong & Hyeongyu Yeo & Nam-Wook Cho & Taeuk Ahn, 2018. "Identification of Core Suppliers Based on E-Invoice Data Using Supervised Machine Learning," JRFM, MDPI, vol. 11(4), pages 1-13, October.
    4. Tommaso Orusa & Annalisa Viani & Enrico Borgogno-Mondino, 2024. "Earth Observation Data and Geospatial Deep Learning AI to Assign Contributions to European Municipalities Sen4MUN: An Empirical Application in Aosta Valley (NW Italy)," Land, MDPI, vol. 13(1), pages 1-21, January.
    5. Beatriz Talavera-Velasco & Lourdes Luceño-Moreno & Jesús Martín García & Daniel Vázquez-Estévez, 2018. "DECORE-21: Assessment of occupational stress in police. Confirmatory factor analysis of the original model," PLOS ONE, Public Library of Science, vol. 13(10), pages 1-11, October.
    6. Shang-Ming Zhou & Ronan A Lyons & Owen G Bodger & Ann John & Huw Brunt & Kerina Jones & Mike B Gravenor & Sinead Brophy, 2014. "Local Modelling Techniques for Assessing Micro-Level Impacts of Risk Factors in Complex Data: Understanding Health and Socioeconomic Inequalities in Childhood Educational Attainments," PLOS ONE, Public Library of Science, vol. 9(11), pages 1-14, November.
    7. Youngjoo Cho & Debashis Ghosh, 2021. "Quantile-Based Subgroup Identification for Randomized Clinical Trials," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 13(1), pages 90-128, April.
    8. Ismael Ahrazem Dfuf & José Manuel Mira McWilliams & María Camino González Fernández, 2019. "Multi-Output Conditional Inference Trees Applied to the Electricity Market: Variable Importance Analysis," Energies, MDPI, vol. 12(6), pages 1-24, March.
    9. Han, Dongxiao & Huang, Jian & Lin, Yuanyuan & Shen, Guohao, 2022. "Robust post-selection inference of high-dimensional mean regression with heavy-tailed asymmetric or heteroskedastic errors," Journal of Econometrics, Elsevier, vol. 230(2), pages 416-431.
    10. Makariou, Despoina & Barrieu, Pauline & Chen, Yining, 2021. "A random forest based approach for predicting spreads in the primary catastrophe bond market," Insurance: Mathematics and Economics, Elsevier, vol. 101(PB), pages 140-162.
    11. Aikaterini Lyra & Athanasios Loukas, 2023. "Simulation and Evaluation of Water Resources Management Scenarios Under Climate Change for Adaptive Management of Coastal Agricultural Watersheds," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 37(6), pages 2625-2642, May.
    12. Back, Paula Regina, 2019. "On the relationship between corporate social responsibility and competitive performance in Brazilian Small and Medium Enterprises - empirical evidence from a stakeholders’ perspective," Thesis Commons bdumw, Center for Open Science.
    13. Christine Porzelius & Martin Schumacher & Harald Binder, 2011. "The benefit of data-based model complexity selection via prediction error curves in time-to-event data," Computational Statistics, Springer, vol. 26(2), pages 293-302, June.
    14. Foucher Yohann & Danger Richard, 2012. "Time Dependent ROC Curves for the Estimation of True Prognostic Capacity of Microarray Data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(6), pages 1-22, November.
    15. J. Choi & S. Ye & K. H. Eng & K. Korthauer & W. H. Bradley & J. S. Rader & C. Kendziorski, 2017. "IPI59: An Actionable Biomarker to Improve Treatment Response in Serous Ovarian Carcinoma Patients," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 9(1), pages 1-12, June.
    16. Rossella Tatoli & Luisa Lampignano & Rossella Donghia & Alfredo Niro & Fabio Castellana & Ilaria Bortone & Roberta Zupo & Sarah Tirelli & Madia Lozupone & Francesco Panza & Giovanni Alessio & Francesc, 2023. "Retinal Microvasculature and Neural Changes and Dietary Patterns in an Older Population in Southern Italy," IJERPH, MDPI, vol. 20(6), pages 1-17, March.
    17. Peter Calhoun & Melodie J. Hallett & Xiaogang Su & Guy Cafri & Richard A. Levine & Juanjuan Fan, 2020. "Random forest with acceptance–rejection trees," Computational Statistics, Springer, vol. 35(3), pages 983-999, September.
    18. Hoora Moradian & Denis Larocque & François Bellavance, 2017. "$$L_1$$ L 1 splitting rules in survival forests," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 23(4), pages 671-691, October.
    19. Yiwei Fan & Gang Wang & Xiaoling Lu & Gaobin Wang, 2019. "Distributed forecasting and ant colony optimization for the bike-sharing rebalancing problem with unserved demands," PLOS ONE, Public Library of Science, vol. 14(12), pages 1-26, December.
    20. Bimba, Andrew Thomas & Idris, Norisma & Al-Hunaiyyan, Ahmed & Mahmud, Rohana Binti & Abdelaziz, Ahmed & Khan, Suleman & Chang, Victor, 2016. "Towards knowledge modeling and manipulation technologies: A survey," International Journal of Information Management, Elsevier, vol. 36(6), pages 857-871.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0154515. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.