Gender identification on Twitter

Gender identification on Twitter

Author

Listed:

Catherine Ikae
Jacques Savoy

Abstract

To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n‐gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest‐neighbors, support vector machine, naïve Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF‐PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2‐stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.

Suggested Citation

Catherine Ikae & Jacques Savoy, 2022. "Gender identification on Twitter," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 73(1), pages 58-69, January.

Handle: RePEc:bla:jinfst:v:73:y:2022:i:1:p:58-69
DOI: 10.1002/asi.24541

Download full text from publisher

References listed on IDEAS

Friedman, Jerome H., 2002. "Stochastic gradient boosting," Computational Statistics & Data Analysis, Elsevier, vol. 38(4), pages 367-378, February.
Donna Harman, 1991. "How effective is suffixing?," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 42(1), pages 7-15, January.
Sasa Adamovic & Vladislav Miskovic & Milan Milosavljevic & Marko Sarac & Mladen Veinovic, 2019. "Automated language‐independent authorship verification (for Indo‐European languages)," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 70(8), pages 858-871, August.

Full references (including those not matched with items on IDEAS)

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Mansoor, Umer & Jamal, Arshad & Su, Junbiao & Sze, N.N. & Chen, Anthony, 2023. "Investigating the risk factors of motorcycle crash injury severity in Pakistan: Insights and policy recommendations," Transport Policy, Elsevier, vol. 139(C), pages 21-38.
Matthew Smith & Francisco Alvarez, 2022. "Predicting Firm-Level Bankruptcy in the Spanish Economy Using Extreme Gradient Boosting," Computational Economics, Springer;Society for Computational Economics, vol. 59(1), pages 263-295, January.
Peiró-Signes, Ángel & Segarra-Oña, Marival & Trull-Domínguez, Óscar & Sánchez-Planelles, Joaquín, 2022. "Exposing the ideal combination of endogenous–exogenous drivers for companies’ ecoinnovative orientation: Results from machine-learning methods," Socio-Economic Planning Sciences, Elsevier, vol. 79(C).
Richard Berk, 2019. "Accuracy and Fairness for Juvenile Justice Risk Assessments," Journal of Empirical Legal Studies, John Wiley & Sons, vol. 16(1), pages 175-194, March.
Robert Suchting & Michael S. Businelle & Stephen W. Hwang & Nikhil S. Padhye & Yijiong Yang & Diane M. Santa Maria, 2020. "Predicting Daily Sheltering Arrangements among Youth Experiencing Homelessness Using Diary Measurements Collected by Ecological Momentary Assessment," IJERPH, MDPI, vol. 17(18), pages 1-17, September.
Ylinen, Mika & Ranta, Mikko, 2025. "Predicting corporate innovation using machine learning and social media data," Technovation, Elsevier, vol. 148(C).
Müller, Daniel & Leitão, Pedro J. & Sikor, Thomas, 2013. "Comparing the determinants of cropland abandonment in Albania and Romania using boosted regression trees," Agricultural Systems, Elsevier, vol. 117(C), pages 66-77.
Bissan Ghaddar & Ignacio Gómez-Casares & Julio González-Díaz & Brais González-Rodríguez & Beatriz Pateiro-López & Sofía Rodríguez-Ballesteros, 2023. "Learning for Spatial Branching: An Algorithm Selection Approach," INFORMS Journal on Computing, INFORMS, vol. 35(5), pages 1024-1043, September.
Huang Lin & Merete Eggesbø & Shyamal Das Peddada, 2022. "Linear and nonlinear correlation estimators unveil undescribed taxa interactions in microbiome data," Nature Communications, Nature, vol. 13(1), pages 1-16, December.
Akash Malhotra, 2018. "A hybrid econometric-machine learning approach for relative importance analysis: Prioritizing food policy," Papers 1806.04517, arXiv.org, revised Aug 2020.
Somodi, Imelda & Bede-Fazekas, Ákos & Botta-Dukát, Zoltán & Molnár, Zsolt, 2024. "Confidence and consistency in discrimination: A new family of evaluation metrics for potential distribution models," Ecological Modelling, Elsevier, vol. 491(C).
María Jesús Segovia‐Vargas & I. Marta Miranda‐García & Freddy Alejandro Oquendo‐Torres, 2023. "Sustainable finance: The role of savings and credit cooperatives in Ecuador," Annals of Public and Cooperative Economics, Wiley Blackwell, vol. 94(3), pages 951-980, September.
Yuehan Ai & Fan He & Emma Lancaster & Jiyoung Lee, 2022. "Application of machine learning for multi-community COVID-19 outbreak predictions with wastewater surveillance," PLOS ONE, Public Library of Science, vol. 17(11), pages 1-12, November.
Tesfamariam Engida Mengesha & Lulseged Tamene Desta & Paolo Gamba & Getachew Tesfaye Ayehu, 2024. "Multi-Temporal Passive and Active Remote Sensing for Agricultural Mapping and Acreage Estimation in Context of Small Farm Holds in Ethiopia," Land, MDPI, vol. 13(3), pages 1-29, March.
Divya Chandran & N. R. Chithra, 2025. "Predictive Performance of Ensemble Learning Boosting Techniques in Daily Streamflow Simulation," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 39(3), pages 1235-1259, February.
Junming Liu & Mingfei Teng & Weiwei Chen & Hui Xiong, 2023. "A Cost-Effective Sequential Route Recommender System for Taxi Drivers," INFORMS Journal on Computing, INFORMS, vol. 35(5), pages 1098-1119, September.
Nahushananda Chakravarthy H G & Karthik M Seenappa & Sujay Raghavendra Naganna & Dayananda Pruthviraja, 2023. "Machine Learning Models for the Prediction of the Compressive Strength of Self-Compacting Concrete Incorporating Incinerated Bio-Medical Waste Ash," Sustainability, MDPI, vol. 15(18), pages 1-22, September.
Marlene A. Smith & Murray J. Côté, 2022. "Predictive Analytics Improves Sales Forecasts for a Pop-up Retailer," Interfaces, INFORMS, vol. 52(4), pages 379-389, July.
Tim Voigt & Martin Kohlhase & Oliver Nelles, 2021. "Incremental DoE and Modeling Methodology with Gaussian Process Regression: An Industrially Applicable Approach to Incorporate Expert Knowledge," Mathematics, MDPI, vol. 9(19), pages 1-26, October.
Wen, Shaoting & Buyukada, Musa & Evrendilek, Fatih & Liu, Jingyong, 2020. "Uncertainty and sensitivity analyses of co-combustion/pyrolysis of textile dyeing sludge and incense sticks: Regression and machine-learning models," Renewable Energy, Elsevier, vol. 151(C), pages 463-474.

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jinfst:v:73:y:2022:i:1:p:58-69. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Gender identification on Twitter

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Most related items

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data