IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0297440.html

Sound symbolism in Japanese names: Machine learning approaches to gender classification

Author

Listed:
  • Chun Hau Ngai
  • Alexander J Kilpatrick
  • Aleksandra Ćwiek

Abstract

This study investigates the sound symbolic expressions of gender in Japanese names with machine learning algorithms. The main goal of this study is to explore how gender is expressed in the phonemes that make up Japanese names and whether systematic sound-meaning mappings, observed in Indo-European languages, extend to Japanese. In addition to this, this study compares the performance of machine learning algorithms. Random Forest and XGBoost algorithms are trained using the sounds of names and the typical gender of the referents as the dependent variable. Each algorithm is cross-validated using k-fold cross-validation (28 folds) and tested on samples not included in the training cycle. Both algorithms are shown to be reasonably accurate at classifying names into gender categories; however, the XGBoost model performs significantly better than the Random Forest algorithm. Feature importance scores reveal that certain sounds carry gender information. Namely, the voiced bilabial nasal /m/ and voiceless velar consonant /k/ were associated with femininity, and the high front vowel /i/ were associated with masculinity. The association observed for /i/ and /k/ stand contrary to typical patterns found in other languages, suggesting that Japanese is unique in the sound symbolic expression of gender. This study highlights the importance of considering cultural and linguistic nuances in sound symbolism research and underscores the advantage of XGBoost in capturing complex relationships within the data for improved classification accuracy. These findings contribute to the understanding of sound symbolism and gender associations in language.

Suggested Citation

  • Chun Hau Ngai & Alexander J Kilpatrick & Aleksandra Ćwiek, 2024. "Sound symbolism in Japanese names: Machine learning approaches to gender classification," PLOS ONE, Public Library of Science, vol. 19(3), pages 1-15, March.
  • Handle: RePEc:plo:pone00:0297440
    DOI: 10.1371/journal.pone.0297440
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0297440
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0297440&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0297440?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Wright, Marvin N. & Ziegler, Andreas, 2017. "ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 77(i01).
    2. B. Pawlowski & R. I. M. Dunbar & A. Lipowicz, 2000. "Tall men have more reproductive success," Nature, Nature, vol. 403(6766), pages 156-156, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Backer, David & Billing, Trey, 2024. "Forecasting the prevalence of child acute malnutrition using environmental and conflict conditions as leading indicators," World Development, Elsevier, vol. 176(C).
    2. Luis A Barboza & Shu-Wei Chou-Chen & Paola Vásquez & Yury E García & Juan G Calvo & Hugo G Hidalgo & Fabio Sanchez, 2023. "Assessing dengue fever risk in Costa Rica by using climate variables and machine learning techniques," PLOS Neglected Tropical Diseases, Public Library of Science, vol. 17(1), pages 1-13, January.
    3. Mariana Oliveira & Luís Torgo & Vítor Santos Costa, 2021. "Evaluation Procedures for Forecasting with Spatiotemporal Data," Mathematics, MDPI, vol. 9(6), pages 1-27, March.
    4. Hausner, Ryan, 2026. "Genre and Temporal Dynamics in Spotify Popularity Prediction," SocArXiv vba8f_v1, Center for Open Science.
    5. Bokelmann, Björn & Lessmann, Stefan, 2024. "Improving uplift model evaluation on randomized controlled trial data," European Journal of Operational Research, Elsevier, vol. 313(2), pages 691-707.
    6. Joel Podgorski & Oliver Kracht & Luis Araguas-Araguas & Stefan Terzer-Wassmuth & Jodie Miller & Ralf Straub & Rolf Kipfer & Michael Berg, 2024. "Groundwater vulnerability to pollution in Africa’s Sahel region," Nature Sustainability, Nature, vol. 7(5), pages 558-567, May.
    7. Heinisch, Katja & Scaramella, Fabio & Schult, Christoph, 2025. "Assumption errors and forecast accuracy: A partial linear instrumental variable and double machine learning approach," IWH Discussion Papers 6/2025, Halle Institute for Economic Research (IWH).
    8. Nayiri Galestian Pour & Soudabeh Shemehsavar, 2024. "Learning from high dimensional data based on weighted feature importance in decision tree ensembles," Computational Statistics, Springer, vol. 39(1), pages 313-342, February.
    9. Bazyli Czyżewski & Jakub Staniszewski & Joanna Staniszewska & Marta Guth, 2025. "Does Increasing Agricultural Efficiency Contribute to Food Security—Trade‐Offs of Value Addition in Crop Production?," Sustainable Development, John Wiley & Sons, Ltd., vol. 33(S1), pages 939-970, November.
    10. Arjan S. Gosal & Janine A. McMahon & Katharine M. Bowgen & Catherine H. Hoppe & Guy Ziv, 2021. "Identifying and Mapping Groups of Protected Area Visitors by Environmental Awareness," Land, MDPI, vol. 10(6), pages 1-14, May.
    11. David Dorn & Florian Schoner & Moritz Seebacher & Lisa Simon & Ludger Woessmann, 2024. "Multidimensional Skills on LinkedIn Profiles: Measuring Human Capital and the Gender Skill Gap," Papers 2409.18638, arXiv.org, revised May 2025.
    12. Albert Stuart Reece & Gary Kenneth Hulse, 2022. "European Epidemiological Patterns of Cannabis- and Substance-Related Congenital Neurological Anomalies: Geospatiotemporal and Causal Inferential Study," IJERPH, MDPI, vol. 20(1), pages 1-35, December.
    13. Michael Parzinger & Lucia Hanfstaengl & Ferdinand Sigg & Uli Spindler & Ulrich Wellisch & Markus Wirnsberger, 2020. "Residual Analysis of Predictive Modelling Data for Automated Fault Detection in Building’s Heating, Ventilation and Air Conditioning Systems," Sustainability, MDPI, vol. 12(17), pages 1-18, August.
    14. Nance Nerissa & Mertens Andrew & Gerds Thomas Alexander & Wang Zeyi & Torp-Pedersen Christian & van der Laan Mark & Kvist Kajsa & Lange Theis & Zareini Bochra & Petersen Maya L., 2025. "Applying the Causal Roadmap to longitudinal national registry data in Denmark: A case study of second-line diabetes medication and dementia," Journal of Causal Inference, De Gruyter, vol. 13(1), pages 1-18.
    15. Chen, Jianbao & Shen, Jiamin & Ke, Nan, 2025. "Assessing the impact of new energy demonstration city policy on industrial carbon intensity using machine learning," Economic Analysis and Policy, Elsevier, vol. 87(C), pages 1690-1707.
    16. Van Belle, Jente & Guns, Tias & Verbeke, Wouter, 2021. "Using shared sell-through data to forecast wholesaler demand in multi-echelon supply chains," European Journal of Operational Research, Elsevier, vol. 288(2), pages 466-479.
    17. Tania L. Maxwell & Mark D. Spalding & Daniel A. Friess & Nicholas J. Murray & Kerrylee Rogers & Andre S. Rovai & Lindsey S. Smart & Lukas Weilguny & Maria Fernanda Adame & Janine B. Adams & William E., 2024. "Soil carbon in the world’s tidal marshes," Nature Communications, Nature, vol. 15(1), pages 1-16, December.
    18. Albert Stuart Reece & Gary Kenneth Hulse, 2022. "European Epidemiological Patterns of Cannabis- and Substance-Related Body Wall Congenital Anomalies: Geospatiotemporal and Causal Inferential Study," IJERPH, MDPI, vol. 19(15), pages 1-38, July.
    19. Andrew P. Wheeler & Wouter Steenbeek, 2021. "Mapping the Risk Terrain for Crime Using Machine Learning," Journal of Quantitative Criminology, Springer, vol. 37(2), pages 445-480, June.
    20. Philipp Bach & Victor Chernozhukov & Malte S. Kurz & Martin Spindler & Sven Klaassen, 2021. "DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R," Papers 2103.09603, arXiv.org, revised Jun 2024.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0297440. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.