IDEAS home Printed from https://ideas.repec.org/a/inm/orijoc/v35y2023i3p675-691.html
   My bibliography  Save this article

A Bayesian Semisupervised Approach to Keyword Extraction with Only Positive and Unlabeled Data

Author

Listed:
  • Guanshen Wang

    (Department of Statistical Science, Southern Methodist University, Dallas, Texas 75205)

  • Yichen Cheng

    (Institute for Insight, Georgia State University, Atlanta, Georgia 30303)

  • Yusen Xia

    (Institute for Insight, Georgia State University, Atlanta, Georgia 30303)

  • Qiang Ling

    (Department of Automation, University of Science and Technology of China, Hefei, Anhui 230026, China)

  • Xinlei Wang

    (Department of Statistical Science, Southern Methodist University, Dallas, Texas 75205; Department of Mathematics, University of Texas at Arlington, Arlington, Texas 76019; Center for Data Science Research and Education, College of Science, University of Texas at Arlington, Arlington, Texas 76019)

Abstract

In the era of big data, people benefit from the existence of tremendous amounts of information. However, availability of said information may pose great challenges. For instance, one big challenge is how to extract useful yet succinct information in an automated fashion. As one of the first few efforts, keyword extraction methods summarize an article by identifying a list of keywords. Many existing keyword extraction methods focus on the unsupervised setting, with all keywords assumed unknown. In reality, a (small) subset of the keywords may be available for a particular article. To use such information, we propose a rigorous probabilistic model based on a semisupervised setup. Our method incorporates the graph-based information of an article into a Bayesian framework via an informative prior so that our model facilitates formal statistical inference, which is often absent from existing methods. To overcome the difficulty arising from high-dimensional posterior sampling, we develop two Markov chain Monte Carlo algorithms based on Gibbs samplers and compare their performance using benchmark data. We use a false discovery rate (FDR)-based approach for selecting the number of keywords, whereas the existing methods use ad hoc threshold values. Our numerical results show that the proposed method compared favorably with state-of-the-art methods for keyword extraction.

Suggested Citation

  • Guanshen Wang & Yichen Cheng & Yusen Xia & Qiang Ling & Xinlei Wang, 2023. "A Bayesian Semisupervised Approach to Keyword Extraction with Only Positive and Unlabeled Data," INFORMS Journal on Computing, INFORMS, vol. 35(3), pages 675-691, May.
  • Handle: RePEc:inm:orijoc:v:35:y:2023:i:3:p:675-691
    DOI: 10.1287/ijoc.2023.1283
    as

    Download full text from publisher

    File URL: http://dx.doi.org/10.1287/ijoc.2023.1283
    Download Restriction: no

    File URL: https://libkey.io/10.1287/ijoc.2023.1283?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inm:orijoc:v:35:y:2023:i:3:p:675-691. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Asher (email available below). General contact details of provider: https://edirc.repec.org/data/inforea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.