IDEAS home Printed from https://ideas.repec.org/a/bla/jinfst/v73y2022i1p58-69.html
   My bibliography  Save this article

Gender identification on Twitter

Author

Listed:
  • Catherine Ikae
  • Jacques Savoy

Abstract

To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n‐gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest‐neighbors, support vector machine, naïve Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF‐PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2‐stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.

Suggested Citation

  • Catherine Ikae & Jacques Savoy, 2022. "Gender identification on Twitter," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 73(1), pages 58-69, January.
  • Handle: RePEc:bla:jinfst:v:73:y:2022:i:1:p:58-69
    DOI: 10.1002/asi.24541
    as

    Download full text from publisher

    File URL: https://doi.org/10.1002/asi.24541
    Download Restriction: no

    File URL: https://libkey.io/10.1002/asi.24541?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Friedman, Jerome H., 2002. "Stochastic gradient boosting," Computational Statistics & Data Analysis, Elsevier, vol. 38(4), pages 367-378, February.
    2. Donna Harman, 1991. "How effective is suffixing?," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 42(1), pages 7-15, January.
    3. Sasa Adamovic & Vladislav Miskovic & Milan Milosavljevic & Marko Sarac & Mladen Veinovic, 2019. "Automated language‐independent authorship verification (for Indo‐European languages)," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 70(8), pages 858-871, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mansoor, Umer & Jamal, Arshad & Su, Junbiao & Sze, N.N. & Chen, Anthony, 2023. "Investigating the risk factors of motorcycle crash injury severity in Pakistan: Insights and policy recommendations," Transport Policy, Elsevier, vol. 139(C), pages 21-38.
    2. Matthew Smith & Francisco Alvarez, 2022. "Predicting Firm-Level Bankruptcy in the Spanish Economy Using Extreme Gradient Boosting," Computational Economics, Springer;Society for Computational Economics, vol. 59(1), pages 263-295, January.
    3. Takahiro Yabe & P. Suresh C. Rao & Satish V. Ukkusuri, 2021. "Modeling the Influence of Online Social Media Information on Post-Disaster Mobility Decisions," Sustainability, MDPI, vol. 13(9), pages 1-13, May.
    4. Stephen J. Tulowiecki & Brice B. Hanberry & Marc D. Abrams, 2025. "Spatial and Temporal Pervasiveness of Indigenous Settlement in Oak Landscapes of Southern New England, US, During the Late Holocene," Land, MDPI, vol. 14(3), pages 1-25, March.
    5. Petra M. Kuhnert & Kerrie Mengersen & Peter Tesar, 2003. "Bridging the Gap between Different Statistical Approaches: An Integrated Framework for Modelling," International Statistical Review, International Statistical Institute, vol. 71(2), pages 335-368, August.
    6. Christian Troost & Julia Parussis-Krech & Matías Mejaíl & Thomas Berger, 2023. "Boosting the Scalability of Farm-Level Models: Efficient Surrogate Modeling of Compositional Simulation Output," Computational Economics, Springer;Society for Computational Economics, vol. 62(3), pages 721-759, October.
    7. Peiró-Signes, Ángel & Segarra-Oña, Marival & Trull-Domínguez, Óscar & Sánchez-Planelles, Joaquín, 2022. "Exposing the ideal combination of endogenous–exogenous drivers for companies’ ecoinnovative orientation: Results from machine-learning methods," Socio-Economic Planning Sciences, Elsevier, vol. 79(C).
    8. Richard Berk, 2019. "Accuracy and Fairness for Juvenile Justice Risk Assessments," Journal of Empirical Legal Studies, John Wiley & Sons, vol. 16(1), pages 175-194, March.
    9. Philippe Goulet Coulombe, 2021. "Slow-Growing Trees," Working Papers 21-02, Chair in macroeconomics and forecasting, University of Quebec in Montreal's School of Management.
    10. Simon J Pittman & Kerry A Brown, 2011. "Multi-Scale Approach for Predicting Fish Species Distributions across Coral Reef Seascapes," PLOS ONE, Public Library of Science, vol. 6(5), pages 1-12, May.
    11. Laviolette, Jérôme & Morency, Catherine & Waygood, E.O.D., 2022. "A kilometer or a mile? Does buffer size matter when it comes to car ownership?," Journal of Transport Geography, Elsevier, vol. 104(C).
    12. Oz, Ibrahim Onur & Yelkenci, Tezer & Meral, Gorkem, 2021. "The role of earnings components and machine learning on the revelation of deteriorating firm performance," International Review of Financial Analysis, Elsevier, vol. 77(C).
    13. Nasios, Ioannis & Vogklis, Konstantinos, 2022. "Blending gradient boosted trees and neural networks for point and probabilistic forecasting of hierarchical time series," International Journal of Forecasting, Elsevier, vol. 38(4), pages 1448-1459.
    14. Robert Suchting & Michael S. Businelle & Stephen W. Hwang & Nikhil S. Padhye & Yijiong Yang & Diane M. Santa Maria, 2020. "Predicting Daily Sheltering Arrangements among Youth Experiencing Homelessness Using Diary Measurements Collected by Ecological Momentary Assessment," IJERPH, MDPI, vol. 17(18), pages 1-17, September.
    15. Scott Wentland & Gary Cornwall & Jeremy G. Moulton, 2023. "For What It's Worth: Measuring Land Value in the Era of Big Data and Machine Learning," BEA Papers 0115, Bureau of Economic Analysis.
    16. Matthias Bogaert & Michel Ballings & Martijn Hosten & Dirk Van den Poel, 2017. "Identifying Soccer Players on Facebook Through Predictive Analytics," Decision Analysis, INFORMS, vol. 14(4), pages 274-297, December.
    17. Eline Auwera & Bert D’Espallier & Roy Mersland, 2024. "Achieving Double Bottom-Line Performance in Hybrid Organisations: A Machine-Learning Approach," Journal of Business Ethics, Springer, vol. 190(3), pages 625-647, March.
    18. Tsao, Yu-Chung & Chen, Yu-Kai & Chiu, Shih-Hao & Lu, Jye-Chyi & Vu, Thuy-Linh, 2022. "An innovative demand forecasting approach for the server industry," Technovation, Elsevier, vol. 110(C).
    19. Sabyasachi Mohapatra & Rohan Mukherjee & Arindam Roy & Anirban Sengupta & Amit Puniyani, 2022. "Can Ensemble Machine Learning Methods Predict Stock Returns for Indian Banks Using Technical Indicators?," JRFM, MDPI, vol. 15(8), pages 1-16, August.
    20. Hoeschle, Lisa & Wang, Hong Holly & Yu, Xiaohua, 2024. "State-level heterogeneities in US food insecurity – an assessment of long-term predictors," 2024 Annual Meeting, July 28-30, New Orleans, LA 343686, Agricultural and Applied Economics Association.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jinfst:v:73:y:2022:i:1:p:58-69. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.asis.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.