IDEAS home Printed from https://ideas.repec.org/a/gam/jsusta/v14y2022i16p9939-d885794.html
   My bibliography  Save this article

Machine-Learning-Based Gender Distribution Prediction from Anonymous News Comments: The Case of Korean News Portal

Author

Listed:
  • Jong Hwan Suh

    (Department of Management Information Systems & BERI, Gyeongsang National University, 501 Jinjudae-ro, Jinju-si 52828, Korea)

Abstract

Anonymous news comment data from a news portal in South Korea, naver.com, can help conduct gender research and resolve related issues for sustainable societies. Nevertheless, only a small portion of gender information (i.e., gender distribution) is open to the public, and therefore, it has rarely been considered for gender research. Hence, this paper aims to resolve the matter of incomplete gender information and make the anonymous news comment data usable for gender research as new social media big data. This paper proposes a machine-learning-based approach for predicting the gender distribution (i.e., male and female rates) of anonymous news commenters for a news article. Initially, the big data of news articles and their anonymous news comments were collected and divided into labeled and unlabeled datasets (i.e., with and without gender information). The word2vec approach was employed to represent a news article by the characteristics of the news comments. Then, using the labeled dataset, various prediction techniques were evaluated for predicting the gender distribution of anonymous news commenters for a labeled news article. As a result, the neural network was selected as the best prediction technique, and it could accurately predict the gender distribution of anonymous news commenters of the labeled news article. Thus, this study showed that a machine-learning-based approach can overcome the incomplete gender information problem of anonymous social media users. Moreover, when the gender distributions of the unlabeled news articles were predicted using the best neural network model, trained with the labeled dataset, their distribution turned out different from the labeled news articles. The result indicates that using only the labeled dataset for gender research can result in misleading findings and distorted conclusions. The predicted gender distributions for the unlabeled news articles can help to better understand anonymous news commenters as humans for sustainable societies. Eventually, this study provides a new way for data-driven computational social science with incomplete and anonymous social media big data.

Suggested Citation

  • Jong Hwan Suh, 2022. "Machine-Learning-Based Gender Distribution Prediction from Anonymous News Comments: The Case of Korean News Portal," Sustainability, MDPI, vol. 14(16), pages 1-17, August.
  • Handle: RePEc:gam:jsusta:v:14:y:2022:i:16:p:9939-:d:885794
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2071-1050/14/16/9939/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2071-1050/14/16/9939/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Suh, Jong Hwan, 2015. "Forecasting the daily outbreak of topic-level political risk from social media using hidden Markov model-based techniques," Technological Forecasting and Social Change, Elsevier, vol. 94(C), pages 115-132.
    2. Teso, E. & Olmedilla, M. & Martínez-Torres, M.R. & Toral, S.L., 2018. "Application of text mining techniques to the analysis of discourse in eWOM communications from a gender perspective," Technological Forecasting and Social Change, Elsevier, vol. 129(C), pages 131-142.
    3. Jong Hwan Suh, 2019. "SocialTERM-Extractor: Identifying and Predicting Social-Problem-Specific Key Noun Terms from a Large Number of Online News Articles Using Text Mining and Machine Learning Techniques," Sustainability, MDPI, vol. 11(1), pages 1-44, January.
    4. Robin Hirt & Niklas Kühl & Gerhard Satzger, 2019. "Cognitive computing for customer profiling: meta classification for gender prediction," Electronic Markets, Springer;IIM University of St. Gallen, vol. 29(1), pages 93-106, March.
    5. H Andrew Schwartz & Johannes C Eichstaedt & Margaret L Kern & Lukasz Dziurzynski & Stephanie M Ramones & Megha Agrawal & Achal Shah & Michal Kosinski & David Stillwell & Martin E P Seligman & Lyle H U, 2013. "Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach," PLOS ONE, Public Library of Science, vol. 8(9), pages 1-16, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jean M. Twenge & Hannah VanLandingham & W. Keith Campbell, 2017. "The Seven Words You Can Never Say on Television: Increases in the Use of Swear Words in American Books, 1950-2008," SAGE Open, , vol. 7(3), pages 21582440177, August.
    2. Gallus, Jana & Bhatia, Sudeep, 2020. "Gender, power and emotions in the collaborative production of knowledge: A large-scale analysis of Wikipedia editor conversations," Organizational Behavior and Human Decision Processes, Elsevier, vol. 160(C), pages 115-130.
    3. Liang Xu & Min Xu & Zehua Jiang & Xin Wen & Yishan Liu & Zaoyi Sun & Hongting Li & Xiuying Qian, 2023. "How have music emotions been described in Google books? Historical trends and corpus differences," Palgrave Communications, Palgrave Macmillan, vol. 10(1), pages 1-11, December.
    4. Karol Król & Dariusz Zdonek, 2021. "Most Often Motivated by Social Media: The Who, the What, and the How Much—Experience from Poland," Sustainability, MDPI, vol. 13(20), pages 1-20, October.
    5. Jong Hwan Suh, 2019. "SocialTERM-Extractor: Identifying and Predicting Social-Problem-Specific Key Noun Terms from a Large Number of Online News Articles Using Text Mining and Machine Learning Techniques," Sustainability, MDPI, vol. 11(1), pages 1-44, January.
    6. Vivek Kulkarni & Margaret L Kern & David Stillwell & Michal Kosinski & Sandra Matz & Lyle Ungar & Steven Skiena & H Andrew Schwartz, 2018. "Latent human traits in the language of social media: An open-vocabulary approach," PLOS ONE, Public Library of Science, vol. 13(11), pages 1-18, November.
    7. Karel Hrazdil & Jiri Novak & Rafael Rogo & Christine Wiedman & Ray Zhang, 2020. "Measuring executive personality using machine‐learning algorithms: A new approach and audit fee‐based validation tests," Journal of Business Finance & Accounting, Wiley Blackwell, vol. 47(3-4), pages 519-544, March.
    8. Luo, Shuli & He, Sylvia Y., 2021. "Understanding gender difference in perceptions toward transit services across space and time: A social media mining approach," Transport Policy, Elsevier, vol. 111(C), pages 63-73.
    9. Gow, Ian D. & Kaplan, Steven N. & Larcker, David F. & Zakolyukina, Anastasia A., 2016. "CEO Personality and Firm Policies," Research Papers 3444, Stanford University, Graduate School of Business.
    10. Mikkel Wallentin, 2018. "Sex differences in post-stroke aphasia rates are caused by age. A meta-analysis and database query," PLOS ONE, Public Library of Science, vol. 13(12), pages 1-18, December.
    11. Hannes Rosenbusch & Maya Aghaei & Anthony M. Evans & Marcel Zeelenberg, 2021. "Psychological trait inferences from women’s clothing: human and machine prediction," Journal of Computational Social Science, Springer, vol. 4(2), pages 479-501, November.
    12. Eszter Hargittai, 2015. "Is Bigger Always Better? Potential Biases of Big Data Derived from Social Network Sites," The ANNALS of the American Academy of Political and Social Science, , vol. 659(1), pages 63-76, May.
    13. Rachel Winter & Anna Lavis, 2021. "The Impact of COVID-19 on Young People’s Mental Health in the UK: Key Insights from Social Media Using Online Ethnography," IJERPH, MDPI, vol. 19(1), pages 1-13, December.
    14. Daniel Preoţiuc-Pietro & Svitlana Volkova & Vasileios Lampos & Yoram Bachrach & Nikolaos Aletras, 2015. "Studying User Income through Language, Behaviour and Affect in Social Media," PLOS ONE, Public Library of Science, vol. 10(9), pages 1-17, September.
    15. Jungmin Kim & Juyong Park & Wonjae Lee, 2018. "Why do people move? Enhancing human mobility prediction using local functions based on public records and SNS data," PLOS ONE, Public Library of Science, vol. 13(2), pages 1-29, February.
    16. Lushi Chen & Tao Gong & Michal Kosinski & David Stillwell & Robert L Davidson, 2017. "Building a profile of subjective well-being for social media users," PLOS ONE, Public Library of Science, vol. 12(11), pages 1-15, November.
    17. Chunhua Ju & Qiuyang Gu & Yi Fang & Fuguang Bao, 2020. "Research on User Influence Model Integrating Personality Traits under Strong Connection," Sustainability, MDPI, vol. 12(6), pages 1-15, March.
    18. Salvatore Giorgi & David B. Yaden & Johannes C. Eichstaedt & Robert D. Ashford & Anneke E.K. Buffone & H. Andrew Schwartz & Lyle H. Ungar & Brenda Curtis, 2020. "Cultural Differences in Tweeting about Drinking Across the US," IJERPH, MDPI, vol. 17(4), pages 1-14, February.
    19. Bianca E. Lopez & Nicholas R. Magliocca & Andrew T. Crooks, 2019. "Challenges and Opportunities of Social Media Data for Socio-Environmental Systems Research," Land, MDPI, vol. 8(7), pages 1-18, July.
    20. Jacob Levy Abitbol & Eric Fleury & Márton Karsai, 2019. "Optimal Proxy Selection for Socioeconomic Status Inference on Twitter," Complexity, Hindawi, vol. 2019, pages 1-15, May.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jsusta:v:14:y:2022:i:16:p:9939-:d:885794. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.