Topic modeling, long texts and the best number of topics. Some Problems and solutions

My bibliography Save this article

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Author

Listed:

Stefano Sbalchiero
(University of Padova)
Maciej Eder
(Polish Academy of Sciences and Pedagogical University of Kraków)

Registered:

Abstract

The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts’ length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.

Suggested Citation

Stefano Sbalchiero & Maciej Eder, 2020. "Topic modeling, long texts and the best number of topics. Some Problems and solutions," Quality & Quantity: International Journal of Methodology, Springer, vol. 54(4), pages 1095-1108, August.

Handle: RePEc:spr:qualqt:v:54:y:2020:i:4:d:10.1007_s11135-020-00976-w
DOI: 10.1007/s11135-020-00976-w

Download full text from publisher

As the access to this document is restricted, you may want to search for a different version of it.

References listed on IDEAS

Grün, Bettina & Hornik, Kurt, 2011. "topicmodels: An R Package for Fitting Topic Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 40(i13).
Feinerer, Ingo & Hornik, Kurt & Meyer, David, 2008. "Text Mining Infrastructure in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i05).

Full references (including those not matched with items on IDEAS)

Citations

Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

Cited by:

Maria Stella Righettini & Elisa Bordin, 2023. "Exploring food security as a multidimensional topic: twenty years of scientific publications and recent developments," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(3), pages 2739-2758, June.
Javier De la Hoz-M & Mª José Fernández-Gómez & Susana Mendes, 2021. "LDAShiny: An R Package for Exploratory Review of Scientific Literature Based on a Bayesian Probabilistic Model and Machine Learning Tools," Mathematics, MDPI, vol. 9(14), pages 1-21, July.
Arina Wischnewsky & David‐Jan Jansen & Matthias Neuenkirch, 2021. "Financial stability and the Fed: Evidence from congressional hearings," Economic Inquiry, Western Economic Association International, vol. 59(3), pages 1192-1214, July.
- Arina Wischnewsky & David-Jan Jansen & Matthias Neuenkirch, 2019. "Financial stability and the Fed: evidence from congressional hearings," CESifo Working Paper Series 7657, CESifo.
- Wischnewsky, Arina & Jansen, David-Jan & Neuenkirch, Matthias, 2020. "Financial Stability and the Fed: Evidence from Congressional Hearings," VfS Annual Conference 2020 (Virtual Conference): Gender Economics 224527, Verein für Socialpolitik / German Economic Association.
- Arina Wischnewsky & David-Jan Jansen & Matthias Neuenkirch, 2019. "Financial Stability and the Fed: Evidence fromCongressional Hearings," Working Paper Series 2019-05, University of Trier, Research Group Quantitative Finance and Risk Analysis.
- Arina Wischnewsky & David-Jan Jansen & Matthias Neuenkirch, 2019. "Financial Stability and the Fed: Evidence from Congressional Hearings," Research Papers in Economics 2019-08, University of Trier, Department of Economics.
Mohamed M. Mostafa, 2023. "A one-hundred-year structural topic modeling analysis of the knowledge structure of international management research," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3905-3935, August.
Liangchao Huang & Zhengmeng Hou & Yanli Fang & Jianhua Liu & Tianle Shi, 2023. "Evolution of CCUS Technologies Using LDA Topic Model and Derwent Patent Data," Energies, MDPI, vol. 16(6), pages 1-14, March.
Weiss, Daniel & Nemeczek, Fabian, 2021. "A text-based monitoring tool for the legitimacy and guidance of technological innovation systems," Technology in Society, Elsevier, vol. 66(C).
Jessica Birkholz & Jutta Günther & Mariia Shkolnykova, 2021. "Using Topic Modeling in Innovation Studies: The Case of a Small Innovation System under Conditions of Pandemic Related Change," Bremen Papers on Economics & Innovation 2101, University of Bremen, Faculty of Business Studies and Economics.

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Daoud, Adel & Kohl, Sebastian, 2016. "How much do sociologists write about economic topics? Using big data to test some conventional views in economic sociology, 1890 to 2014," MPIfG Discussion Paper 16/7, Max Planck Institute for the Study of Societies.
Cho, Yung-Jan & Fu, Pei-Wen & Wu, Chi-Cheng, 2017. "Popular Research Topics in Marketing Journals, 1995–2014," Journal of Interactive Marketing, Elsevier, vol. 40(C), pages 52-72.
João Guerreiro & Paulo Rita & Duarte Trigueiros, 2016. "A Text Mining-Based Review of Cause-Related Marketing Literature," Journal of Business Ethics, Springer, vol. 139(1), pages 111-128, November.
Abhinav Khare & Qing He & Rajan Batta, 2020. "Predicting gasoline shortage during disasters using social media," OR Spectrum: Quantitative Approaches in Management, Springer;Gesellschaft für Operations Research e.V., vol. 42(3), pages 693-726, September.
Lehotský, Lukáš & Černoch, Filip & Osička, Jan & Ocelík, Petr, 2019. "When climate change is missing: Media discourse on coal mining in the Czech Republic," Energy Policy, Elsevier, vol. 129(C), pages 774-786.
Doblinger, Claudia & Surana, Kavita & Li, Deyu & Hultman, Nathan & Anadón, Laura Díaz, 2022. "How do global manufacturing shifts affect long-term clean energy innovation? A study of wind energy suppliers," Research Policy, Elsevier, vol. 51(7).
Andres, Maximilian & Bruttel, Lisa & Friedrichsen, Jana, 2023. "How communication makes the difference between a cartel and tacit collusion: A machine learning approach," European Economic Review, Elsevier, vol. 152(C).
- Andres, Maximilian & Bruttel, Lisa & Friedrichsen, Jana, 2023. "How communication makes the difference between a cartel and tacit collusion: A machine learning approach," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 152, pages 1-1.
- Maximilian Andres & Lisa Bruttel & Jana Friedrichsen, 2022. "How Communication Makes the Difference between a Cartel and Tacit Collusion: A Machine Learning Approach," CESifo Working Paper Series 10024, CESifo.
- Maximilian Andres & Lisa Bruttel & Jana Friedrichsen, 2022. "How communication makes the difference between a cartel and tacit collusion: a machine learning approach," CEPA Discussion Papers 53, Center for Economic Policy Analysis.
- Maximilian Andres & Lisa Bruttel & Jana Friedrichsen, 2022. "How Communication Makes the Difference between a Cartel and Tacit Collusion: A Machine Learning Approach," Discussion Papers of DIW Berlin 2000, DIW Berlin, German Institute for Economic Research.
Hudson Golino & Alexander P. Christensen & Robert Moulder & Seohyun Kim & Steven M. Boker, 2022. "Modeling Latent Topics in Social Media using Dynamic Exploratory Graph Analysis: The Case of the Right-wing and Left-wing Trolls in the 2016 US Elections," Psychometrika, Springer;The Psychometric Society, vol. 87(1), pages 156-187, March.
Sun, Katherine Qianwen & Slepian, Michael L., 2020. "The conversations we seek to avoid," Organizational Behavior and Human Decision Processes, Elsevier, vol. 160(C), pages 87-105.
Rieger, Jonas & von Nordheim, Gerret, 2021. "corona100d: German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV," DoCMA Working Papers 4, TU Dortmund University, Dortmund Center for Data-based Media Analysis (DoCMA).
Garner, Benjamin & Thornton, Corliss & Luo Pawluk, Anita & Mora Cortez, Roberto & Johnston, Wesley & Ayala, Cesar, 2022. "Utilizing text-mining to explore consumer happiness within tourism destinations," Journal of Business Research, Elsevier, vol. 139(C), pages 1366-1377.
Anke Piepenbrink & Elkin Nurmammadov, 2015. "Topics in the literature of transition economies and emerging markets," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2107-2130, March.
Christian WEISMAYER, 2022. "Applied Research in Quality of Life: A Computational Literature Review," Applied Research in Quality of Life, Springer;International Society for Quality-of-Life Studies, vol. 17(3), pages 1433-1458, June.
Arenas Gaitán, Jorge & Ramírez-Correa, Patricio E., 2023. "COVID-19 and telemedicine: A netnography approach," Technological Forecasting and Social Change, Elsevier, vol. 190(C).
Polyzos, Efstathios & Wang, Fang, 2022. "Twitter and market efficiency in energy markets: Evidence using LDA clustered topic extraction," Energy Economics, Elsevier, vol. 114(C).
Jiang, Hanchen & Qiang, Maoshan & Lin, Peng, 2016. "A topic modeling based bibliometric exploration of hydropower research," Renewable and Sustainable Energy Reviews, Elsevier, vol. 57(C), pages 226-237.
Cecilia Elizabeth Bayas Aldaz & Jesus Rodriguez-Pomeda & Leyla Angélica Sandoval Hamón & Fernando Casani, 2020. "Understanding the University-Sustainability Link through Media: A Spanish Perspective," Sustainability, MDPI, vol. 12(12), pages 1-15, June.
Jonas Rieger, 2019. "Mónica Bécue-Bertaut (2019): Textual Data Science with R," Statistical Papers, Springer, vol. 60(5), pages 1797-1798, October.
Andres, Maximilian & Bruttel, Lisa & Friedrichsen, Jana, 2021. "How do sanctions work? The choice between cartel formation and tacit collusion," VfS Annual Conference 2021 (Virtual Conference): Climate Economics 242372, Verein für Socialpolitik / German Economic Association.
Eric A Jensen & Paul Wong & Mark S Reed, 2022. "How research data deliver non-academic impacts: A secondary analysis of UK Research Excellence Framework impact case studies," PLOS ONE, Public Library of Science, vol. 17(3), pages 1-12, March.

More about this item

Keywords

Topic modeling; Latent Dirichlet Allocation; Long texts; Log-likelihood for the model; Best number of topics;
All these keywords.

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:qualqt:v:54:y:2020:i:4:d:10.1007_s11135-020-00976-w. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Citations

Most related items

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data