IDEAS home Printed from https://ideas.repec.org/a/spr/jcsosc/v6y2023i1d10.1007_s42001-022-00191-7.html
   My bibliography  Save this article

A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis

Author

Listed:
  • Sandra Wankmüller

    (Ludwig-Maximilians-Universität)

Abstract

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. https://doi.org/10.2139/ssrn.3026393 ), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477–5490, 2020. https://doi.org/10.18653/v1/2020.aclmain.486 ), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/ ). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.

Suggested Citation

  • Sandra Wankmüller, 2023. "A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis," Journal of Computational Social Science, Springer, vol. 6(1), pages 91-163, April.
  • Handle: RePEc:spr:jcsosc:v:6:y:2023:i:1:d:10.1007_s42001-022-00191-7
    DOI: 10.1007/s42001-022-00191-7
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s42001-022-00191-7
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s42001-022-00191-7?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Baerg, Nicole & Lowe, Will, 2020. "A textual Taylor rule: estimating central bank preferences combining topic and scaling methods," Political Science Research and Methods, Cambridge University Press, vol. 8(1), pages 106-122, January.
    2. Mikhaylov, Slava & Laver, Michael & Benoit, Kenneth R., 2012. "Coder Reliability and Misclassification in the Human Coding of Party Manifestos," Political Analysis, Cambridge University Press, vol. 20(1), pages 78-91, January.
    3. van Atteveldt, Wouter & Sheafer, Tamir & Shenhav, Shaul R. & Fogel-Dror, Yair, 2017. "Clause Analysis: Using Syntactic Information to Automatically Extract Source, Subject, and Predicate from Texts with an Application to the 2008–2009 Gaza War," Political Analysis, Cambridge University Press, vol. 25(2), pages 207-222, April.
    4. Ennser-Jedenastik, Laurenz & Meyer, Thomas M., 2018. "The Impact of Party Cues on Manual Coding of Political Texts," Political Science Research and Methods, Cambridge University Press, vol. 6(3), pages 625-633, July.
    5. Grün, Bettina & Hornik, Kurt, 2011. "topicmodels: An R Package for Fitting Topic Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 40(i13).
    6. D'Orazio, Vito & Landis, Steven T. & Palmer, Glenn & Schrodt, Philip, 2014. "Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines," Political Analysis, Cambridge University Press, vol. 22(2), pages 224-242, April.
    7. Gary King & Patrick Lam & Margaret E. Roberts, 2017. "Computer‐Assisted Keyword and Document Set Discovery from Unstructured Text," American Journal of Political Science, John Wiley & Sons, vol. 61(4), pages 971-988, October.
    8. Margaret E. Roberts & Brandon M. Stewart & Edoardo M. Airoldi, 2016. "A Model of Text for Experimentation in the Social Sciences," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(515), pages 988-1003, July.
    9. Nicholas Beauchamp, 2017. "Predicting and Interpolating State‐Level Polls Using Twitter Textual Data," American Journal of Political Science, John Wiley & Sons, vol. 61(2), pages 490-503, April.
    10. Muchlinski, David & Yang, Xiao & Birch, Sarah & Macdonald, Craig & Ounis, Iadh, 2021. "We need to go deeper: measuring electoral violence using convolutional neural networks and social media," Political Science Research and Methods, Cambridge University Press, vol. 9(1), pages 122-139, January.
    11. Katagiri, Azusa & Min, Eric, 2019. "The Credibility of Public and Private Signals: A Document-Based Approach," American Political Science Review, Cambridge University Press, vol. 113(1), pages 156-172, February.
    12. Bes, Bart Joachim & Schoonvelde, Martijn & Rauh, Christian, 2020. "Undermining, defusing or defending European integration? Assessing public communication of European executives in times of EU politicisation," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 59(2), pages 397-423.
    13. King, Gary & Pan, Jennifer & Roberts, Margaret E., 2013. "How Censorship in China Allows Government Criticism but Silences Collective Expression," American Political Science Review, Cambridge University Press, vol. 107(2), pages 326-343, May.
    14. Margaret E. Roberts & Brandon M. Stewart & Dustin Tingley & Christopher Lucas & Jetson Leder‐Luis & Shana Kushner Gadarian & Bethany Albertson & David G. Rand, 2014. "Structural Topic Models for Open‐Ended Survey Responses," American Journal of Political Science, John Wiley & Sons, vol. 58(4), pages 1064-1082, October.
    15. Justin Grimmer, 2013. "Appropriators not Position Takers: The Distorting Effects of Electoral Incentives on Congressional Representation," American Journal of Political Science, John Wiley & Sons, vol. 57(3), pages 624-642, July.
    16. Joshua Uyheng & Kathleen M. Carley, 2020. "Bots and online hate during the COVID-19 pandemic: case studies in the United States and the Philippines," Journal of Computational Social Science, Springer, vol. 3(2), pages 445-468, November.
    17. Miller, Blake & Linder, Fridolin & Mebane, Walter R., 2020. "Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches," Political Analysis, Cambridge University Press, vol. 28(4), pages 532-551, October.
    18. Kevin M. Quinn & Burt L. Monroe & Michael Colaresi & Michael H. Crespin & Dragomir R. Radev, 2010. "How to Analyze Political Attention with Minimal Assumptions and Costs," American Journal of Political Science, John Wiley & Sons, vol. 54(1), pages 209-228, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Dehler-Holland, Joris & Schumacher, Kira & Fichtner, Wolf, 2021. "Topic Modeling Uncovers Shifts in Media Framing of the German Renewable Energy Act," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 2(1).
    2. Zhang, Han, 2021. "How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do About It," SocArXiv 453jk, Center for Open Science.
    3. Dehler-Holland, Joris & Okoh, Marvin & Keles, Dogan, 2022. "Assessing technology legitimacy with topic models and sentiment analysis – The case of wind power in Germany," Technological Forecasting and Social Change, Elsevier, vol. 175(C).
    4. Mourtgos, Scott M. & Adams, Ian T., 2019. "The rhetoric of de-policing: Evaluating open-ended survey responses from police officers with machine learning-based structural topic modeling," Journal of Criminal Justice, Elsevier, vol. 64(C), pages 1-1.
    5. Sumeet Sahay & Hemant Kumar Kaushik & Shikha Singh, 2023. "Discovering themes and trends in electricity supply chain area research," OPSEARCH, Springer;Operational Research Society of India, vol. 60(3), pages 1525-1560, September.
    6. Sanders, James & Lisi, Giulio & Schonhardt-Bailey, Cheryl, 2018. "Themes and topics in parliamentary oversight hearings: a new direction in textual data analysis," LSE Research Online Documents on Economics 87624, London School of Economics and Political Science, LSE Library.
    7. McCannon, Bryan & Zhou, Yang & Hall, Joshua, 2021. "Measuring a Contract’s Breadth: A Text Analysis," Working Papers 11013, George Mason University, Mercatus Center.
    8. Marcel Fratzscher & Tobias Heidland & Lukas Menkhoff & Lucio Sarno & Maik Schmeling, 2023. "Foreign Exchange Intervention: A New Database," IMF Economic Review, Palgrave Macmillan;International Monetary Fund, vol. 71(4), pages 852-884, December.
    9. Li Tang & Jennifer Kuzma & Xi Zhang & Xinyu Song & Yin Li & Hongxu Liu & Guangyuan Hu, 2023. "Synthetic biology and governance research in China: a 40-year evolution," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(9), pages 5293-5310, September.
    10. Han, Chunjia & Yang, Mu & Piterou, Athena, 2021. "Do news media and citizens have the same agenda on COVID-19? an empirical comparison of twitter posts," Technological Forecasting and Social Change, Elsevier, vol. 169(C).
    11. Mohamed M. Mostafa, 2023. "A one-hundred-year structural topic modeling analysis of the knowledge structure of international management research," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3905-3935, August.
    12. Ferrara, Federico M. & Masciandaro, Donato & Moschella, Manuela & Romelli, Davide, 2022. "Political voice on monetary policy: Evidence from the parliamentary hearings of the European Central Bank," European Journal of Political Economy, Elsevier, vol. 74(C).
    13. Camilla Salvatore & Silvia Biffignandi & Annamaria Bianchi, 2022. "Corporate Social Responsibility Activities Through Twitter: From Topic Model Analysis to Indexes Measuring Communication Characteristics," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 164(3), pages 1217-1248, December.
    14. Lüdering Jochen & Winker Peter, 2016. "Forward or Backward Looking? The Economic Discourse and the Observed Reality," Journal of Economics and Statistics (Jahrbuecher fuer Nationaloekonomie und Statistik), De Gruyter, vol. 236(4), pages 483-515, August.
    15. Andreas Rehs, 2020. "A structural topic model approach to scientific reorientation of economics and chemistry after German reunification," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1229-1251, November.
    16. Eunhye Park & Junehee Kwon & Bongsug (Kevin) Chae & Sung-Bum Kim, 2021. "What Are the Salient and Memorable Green-Restaurant Attributes? Capturing Customer Perceptions From User-Generated Content," SAGE Open, , vol. 11(3), pages 21582440211, July.
    17. Oliver Wieczorek & Saïd Unger & Jan Riebling & Lukas Erhard & Christian Koß & Raphael Heiberger, 2021. "Mapping the field of psychology: Trends in research topics 1995–2015," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(12), pages 9699-9731, December.
    18. Ulrich Fritsche & Johannes Puckelwald, 2018. "Deciphering Professional Forecasters’ Stories - Analyzing a Corpus of Textual Predictions for the German Economy," Macroeconomics and Finance Series 201804, University of Hamburg, Department of Socioeconomics.
    19. Arina Wischnewsky & David‐Jan Jansen & Matthias Neuenkirch, 2021. "Financial stability and the Fed: Evidence from congressional hearings," Economic Inquiry, Western Economic Association International, vol. 59(3), pages 1192-1214, July.
    20. Lino Wehrheim, 2019. "Economic history goes digital: topic modeling the Journal of Economic History," Cliometrica, Springer;Cliometric Society (Association Francaise de Cliométrie), vol. 13(1), pages 83-125, January.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:jcsosc:v:6:y:2023:i:1:d:10.1007_s42001-022-00191-7. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.