IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v12y2020i9p144-d404427.html
   My bibliography  Save this article

Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment

Author

Listed:
  • Svetlana S. Bodrunova

    (School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia
    These authors contributed equally to this work.)

  • Andrey V. Orekhov

    (School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia
    These authors contributed equally to this work.)

  • Ivan S. Blekanov

    (School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia
    These authors contributed equally to this work.)

  • Nikolay S. Lyudkevich

    (School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia)

  • Nikita A. Tarasov

    (School of Journalism and Mass Communications, Saint Petersburg State University, 7-9 Universitetskaya embankment, 199034 Saint Petersburg, Russia)

Abstract

The paper is dedicated to solving the problem of optimal text classification in the area of automated detection of typology of texts. In conventional approaches to topicality-based text classification (including topic modeling), the number of clusters is to be set up by the scholar, and the optimal number of clusters, as well as the quality of the model that designates proximity of texts to each other, remain unresolved questions. We propose a novel approach to the automated definition of the optimal number of clusters that also incorporates an assessment of word proximity of texts, combined with text encoding model that is based on the system of sentence embeddings. Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerative hierarchical clustering by Ward’s method, and the Markov stopping moment for optimal clustering. The preferred number of clusters is determined based on the “e-2” hypothesis. We set up an experiment on two datasets of real-world labeled data: News20 and BBC. The proposed model is tested against more traditional text representation methods, like bag-of-words and word2vec, to show that it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models with different encoding methods. We use three quality metrics to demonstrate that clustering quality does not drop when the number of clusters grows. Thus, we get close to the convergence of text clustering and text classification.

Suggested Citation

  • Svetlana S. Bodrunova & Andrey V. Orekhov & Ivan S. Blekanov & Nikolay S. Lyudkevich & Nikita A. Tarasov, 2020. "Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment," Future Internet, MDPI, vol. 12(9), pages 1-17, August.
  • Handle: RePEc:gam:jftint:v:12:y:2020:i:9:p:144-:d:404427
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/12/9/144/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/12/9/144/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Svetlana S. Bodrunova & Ivan Blekanov & Anna Smoliarova & Anna Litvinenko, 2019. "Beyond Left and Right: Real-World Political Polarization in Twitter Discussions on Inter-Ethnic Conflicts," Media and Communication, Cogitatio Press, vol. 7(3), pages 119-132.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Svetlana S. Bodrunova, 2022. "Editorial for the Special Issue “Selected Papers from the 9th Annual Conference ‘Comparative Media Studies in Today’s World’ (CMSTW’2021)”," Future Internet, MDPI, vol. 14(11), pages 1-3, November.
    2. Ivan Blekanov & Svetlana S. Bodrunova & Askar Akhmetov, 2021. "Detection of Hidden Communities in Twitter Discussions of Varying Volumes," Future Internet, MDPI, vol. 13(11), pages 1-17, November.
    3. Ivan S. Blekanov & Nikita Tarasov & Svetlana S. Bodrunova, 2022. "Transformer-Based Abstractive Summarization for Reddit and Twitter: Single Posts vs. Comment Pools in Three Languages," Future Internet, MDPI, vol. 14(3), pages 1-25, February.
    4. Andrey V. Orekhov, 2021. "Quasi-Deterministic Processes with Monotonic Trajectories and Unsupervised Machine Learning," Mathematics, MDPI, vol. 9(18), pages 1-26, September.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Iandoli, Luca & Primario, Simonetta & Zollo, Giuseppe, 2021. "The impact of group polarization on the quality of online debate in social media: A systematic literature review," Technological Forecasting and Social Change, Elsevier, vol. 170(C).
    2. Svetlana S. Bodrunova & Anna Litvinenko & Ivan Blekanov & Dmitry Nepiyushchikh, 2021. "Constructive Aggression? Multiple Roles of Aggressive Content in Political Discourse on Russian YouTube," Media and Communication, Cogitatio Press, vol. 9(1), pages 181-194.
    3. Olessia Koltsova & Svetlana S. Bodrunova, 2019. "Public Discussion in Russian Social Media: An Introduction," Media and Communication, Cogitatio Press, vol. 7(3), pages 114-118.
    4. Arora, Swapan Deep & Singh, Guninder Pal & Chakraborty, Anirban & Maity, Moutusy, 2022. "Polarization and social media: A systematic review and research agenda," Technological Forecasting and Social Change, Elsevier, vol. 183(C).
    5. Svetlana S. Bodrunova, 2022. "Editorial for the Special Issue “Selected Papers from the 9th Annual Conference ‘Comparative Media Studies in Today’s World’ (CMSTW’2021)”," Future Internet, MDPI, vol. 14(11), pages 1-3, November.
    6. Ivan Blekanov & Svetlana S. Bodrunova & Askar Akhmetov, 2021. "Detection of Hidden Communities in Twitter Discussions of Varying Volumes," Future Internet, MDPI, vol. 13(11), pages 1-17, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:12:y:2020:i:9:p:144-:d:404427. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.