IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v77y2021i3p1089-1100.html
   My bibliography  Save this article

Statistical inference for natural language processing algorithms with a demonstration using type 2 diabetes prediction from electronic health record notes

Author

Listed:
  • Brian L. Egleston
  • Tian Bai
  • Richard J. Bleicher
  • Stanford J. Taylor
  • Michael H. Lutz
  • Slobodan Vucetic

Abstract

The pointwise mutual information statistic (PMI), which measures how often two words occur together in a document corpus, is a cornerstone of recently proposed popular natural language processing algorithms such as word2vec. PMI and word2vec reveal semantic relationships between words and can be helpful in a range of applications such as document indexing, topic analysis, or document categorization. We use probability theory to demonstrate the relationship between PMI and word2vec. We use the theoretical results to demonstrate how the PMI can be modeled and estimated in a simple and straight forward manner. We further describe how one can obtain standard error estimates that account for within‐patient clustering that arises from patterns of repeated words within a patient's health record due to a unique health history. We then demonstrate the usefulness of PMI on the problem of predictive identification of disease from free text notes of electronic health records. Specifically, we use our methods to distinguish those with and without type 2 diabetes mellitus in electronic health record free text data using over 400 000 clinical notes from an academic medical center.

Suggested Citation

  • Brian L. Egleston & Tian Bai & Richard J. Bleicher & Stanford J. Taylor & Michael H. Lutz & Slobodan Vucetic, 2021. "Statistical inference for natural language processing algorithms with a demonstration using type 2 diabetes prediction from electronic health record notes," Biometrics, The International Biometric Society, vol. 77(3), pages 1089-1100, September.
  • Handle: RePEc:bla:biomet:v:77:y:2021:i:3:p:1089-1100
    DOI: 10.1111/biom.13338
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13338
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13338?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Feinerer, Ingo & Hornik, Kurt & Meyer, David, 2008. "Text Mining Infrastructure in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i05).
    2. Rick L. Williams, 2000. "A Note on Robust Variance Estimation for Cluster-Correlated Data," Biometrics, The International Biometric Society, vol. 56(2), pages 645-646, June.
    3. Leo Egghe & Loet Leydesdorff, 2009. "The relation between Pearson's correlation coefficient r and Salton's cosine measure," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 60(5), pages 1027-1036, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. David H Chae & Sean Clouston & Mark L Hatzenbuehler & Michael R Kramer & Hannah L F Cooper & Sacoby M Wilson & Seth I Stephens-Davidowitz & Robert S Gold & Bruce G Link, 2015. "Association between an Internet-Based Measure of Area Racism and Black Mortality," PLOS ONE, Public Library of Science, vol. 10(4), pages 1-12, April.
    2. Krause, Werner & Giebler, Heiko, 2020. "Shifting Welfare Policy Positions: The Impact of Radical Right Populist Party Success Beyond Migration Politics," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 56(3), pages 331-348.
    3. Gerben ter Riet & Paula Chesley & Alan G Gross & Lara Siebeling & Patrick Muggensturm & Nadine Heller & Martin Umbehr & Daniela Vollenweider & Tsung Yu & Elie A Akl & Lizzy Brewster & Olaf M Dekkers &, 2013. "All That Glitters Isn't Gold: A Survey on Acknowledgment of Limitations in Biomedical Studies," PLOS ONE, Public Library of Science, vol. 8(11), pages 1-6, November.
    4. Doidge, Craig & Andrew Karolyi, G. & Stulz, Rene M., 2007. "Why do countries matter so much for corporate governance?," Journal of Financial Economics, Elsevier, vol. 86(1), pages 1-39, October.
    5. Grinis, Inna, 2017. "The STEM requirements of "non-STEM" jobs: evidence from UK online vacancy postings and implications for skills & knowledge shortages," LSE Research Online Documents on Economics 85123, London School of Economics and Political Science, LSE Library.
    6. Dave, Dhaval, 2008. "Illicit drug use among arrestees, prices and policy," Journal of Urban Economics, Elsevier, vol. 63(2), pages 694-714, March.
    7. Miozzo, Marcela & Desyllas, Panos & Lee, Hsing-fen & Miles, Ian, 2016. "Innovation collaboration and appropriability by knowledge-intensive business services firms," Research Policy, Elsevier, vol. 45(7), pages 1337-1351.
    8. Gagnon, Louis & Karolyi, G. Andrew, 2009. "Information, Trading Volume, and International Stock Return Comovements: Evidence from Cross-Listed Stocks," Journal of Financial and Quantitative Analysis, Cambridge University Press, vol. 44(4), pages 953-986, August.
    9. Julia Bachtrögler & Christoph Hammer & Wolf Heinrich Reuter & Florian Schwendinger, 2019. "Guide to the galaxy of EU regional funds recipients: evidence from new data," Empirica, Springer;Austrian Institute for Economic Research;Austrian Economic Association, vol. 46(1), pages 103-150, February.
    10. Nicolas Jacquemet & Adam Zylbersztejn, 2014. "What drives failure to maximize payoffs in the lab? A test of the inequality aversion hypothesis," Review of Economic Design, Springer;Society for Economic Design, vol. 18(4), pages 243-264, December.
    11. Eva-Lotta Nilsson & Anna-Karin Ivert & Marie Torstensson Levander, 2021. "Adolescents´ Perceptions, Neighbourhood Characteristics and Parental Monitoring -Are they Related, and Do they Interact in the Explanation of Adolescent Offending?," Child Indicators Research, Springer;The International Society of Child Indicators (ISCI), vol. 14(3), pages 1075-1087, June.
    12. Liebig, Stefan & Schupp, Jürgen, 2008. "Leistungs- oder Bedarfsgerechtigkeit? Über einen normativen Zielkonflikt des Wohlfahrtsstaats und seiner Bedeutung für die Bewertung des eigenen Erwerbseinkommens," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 59(1), pages 7-30.
    13. Shuyue Huang & Lena Jingen Liang & Hwansuk Chris Choi, 2022. "How We Failed in Context: A Text-Mining Approach to Understanding Hotel Service Failures," Sustainability, MDPI, vol. 14(5), pages 1-18, February.
    14. Laura Anderlucci & Cinzia Viroli, 2020. "Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(4), pages 759-770, December.
    15. Daniel, Vanessa E. & Florax, Raymond J.G.M. & Rietveld, Piet, 2009. "Flooding risk and housing values: An economic assessment of environmental hazard," Ecological Economics, Elsevier, vol. 69(2), pages 355-365, December.
    16. Stefano Sbalchiero & Maciej Eder, 2020. "Topic modeling, long texts and the best number of topics. Some Problems and solutions," Quality & Quantity: International Journal of Methodology, Springer, vol. 54(4), pages 1095-1108, August.
    17. Timo-Kolja Pförtner & Bart Clercq & Michela Lenzi & Alessio Vieno & Katharina Rathmann & Irene Moor & Anne Hublet & Michal Molcho & Anton Kunst & Matthias Richter, 2015. "Does the association between different dimension of social capital and adolescent smoking vary by socioeconomic status? a pooled cross-national analysis," International Journal of Public Health, Springer;Swiss School of Public Health (SSPH+), vol. 60(8), pages 901-910, December.
    18. Church, A. & Mitchell, R. & Ravenscroft, N. & Stapleton, L.M., 2015. "‘Growing your own’: A multi-level modelling approach to understanding personal food growing trends and motivations in Europe," Ecological Economics, Elsevier, vol. 110(C), pages 71-80.
    19. Jacquemet Nicolas & Zylbersztejn Adam, 2013. "Learning, Words and Actions: Experimental Evidence on Coordination-Improving Information," The B.E. Journal of Theoretical Economics, De Gruyter, vol. 13(1), pages 1-33, July.
    20. Julia R. Henly & Susan J. Lambert, 2014. "Unpredictable Work Timing in Retail Jobs," ILR Review, Cornell University, ILR School, vol. 67(3), pages 986-1016, July.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:77:y:2021:i:3:p:1089-1100. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.