IDEAS home Printed from https://ideas.repec.org/a/rfa/smcjnl/v6y2018i2p83-102.html
   My bibliography  Save this article

Evaluation of the Performance and Efficiency of the Automated Linguistic Features for Author Identification in Short Text Messages Using Different Variable Selection Techniques

Author

Listed:
  • Nils-Axel M?rner

Abstract

The aim of this paper was to evaluate the efficiency of automated linguistic features to test its capacity or discriminating power as style markers for author identification in short text messages of the Facebook genre. The corpus used to evaluate the automated linguistics features was compiled from 221 Facebook texts (each text is about 2 to 3 lines/35-40 words) written in English, which were written in the same genre and topic and posted in the same year group, totaling 7530 words. To compose the dataset for linguistic features performance or evaluation, frequency values were collected from 16 linguistic feature types involving parts of speech, function words, word bigrams, character tri grams, average sentence length in terms of words, average sentence length in terms of characters, Yule¡¯s K measure, Simpson¡¯s D measure, average words length, FW/CW ratio, average characters, content specific key words, type/token ratio, total number of short words less than four characters, contractions, and total number of characters in words which were selected from five corpora, totalling 328 test features. The evaluation of the 16 linguistic feature types differ from those of other analyses because the study used different variable selection methods including feature type frequency, variance, term frequency/ inverse document frequency (TF.IDF), signal-noise ratio, and Poisson term distribution. The relationships between known and anonymous text messages were examined using hierarchical linear and non-hierarchical nonlinear clustering methods, taking into accounts the nonlinear patterns among the data. There were similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms function word and parts of speech usages based on TF.IDF technique and the efficiency of function word usages (=60%) and the efficiency of parts of speech frequencies (=50%). There were no similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms of the other features using feature type frequency and variance techniques in this test and the efficiency of these features in the corpus (

Suggested Citation

  • Nils-Axel M?rner, 2018. "Evaluation of the Performance and Efficiency of the Automated Linguistic Features for Author Identification in Short Text Messages Using Different Variable Selection Techniques," Studies in Media and Communication, Redfame publishing, vol. 6(2), pages 83-102, December.
  • Handle: RePEc:rfa:smcjnl:v:6:y:2018:i:2:p:83-102
    as

    Download full text from publisher

    File URL: http://redfame.com/journal/index.php/smc/article/view/3892/4052
    Download Restriction: no

    File URL: http://redfame.com/journal/index.php/smc/article/view/3892
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Refat Aljumily, 2015. "Hierarchical and Non-Hierarchical Linear and Non-Linear Clustering Methods to “Shakespeare Authorship Question”," Social Sciences, MDPI, vol. 4(3), pages 1-42, September.
    2. Efstathios Stamatatos, 2009. "A survey of modern authorship attribution methods," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 60(3), pages 538-556, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Diego R Amancio, 2015. "Probing the Topological Properties of Complex Networks Modeling Short Written Texts," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-17, February.
    2. Ballandonne, Matthieu & Cersosimo, Igor, 2022. "Towards a “Text as Data” Approach in the History of Economics: An Application to Adam Smith’s Classics," OSF Preprints mg3zb, Center for Open Science.
    3. Malik Muhammad Saad Missen & Sajeeha Qureshi & Nadeem Salamat & Nadeem Akhtar & Hina Asmat & Mickaël Coustaty & V. B. Surya Prasath, 2020. "Scientometric analysis of social science and science disciplines in a developing nation: a case study of Pakistan in the last decade," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(1), pages 113-142, April.
    4. Andi Rexha & Mark Kröll & Hermann Ziak & Roman Kern, 2018. "Authorship identification of documents with high content similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 115(1), pages 223-237, April.
    5. Jacques Savoy & Olena Zubaryeva, 2012. "Simple and efficient classification scheme based on specific vocabulary," Computational Management Science, Springer, vol. 9(3), pages 401-415, August.
    6. Silvia Corbara & Alejandro Moreo & Fabrizio Sebastiani, 2023. "Syllabic quantity patterns as rhythmic features for Latin authorship attribution," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 74(1), pages 128-141, January.
    7. Stefano Sbalchiero & Maria Stella Righettini, 2017. "Rhetorical manifestation of institutional transformation," Quality & Quantity: International Journal of Methodology, Springer, vol. 51(3), pages 1279-1296, May.
    8. Maryam Ebrahimpour & Tālis J Putniņš & Matthew J Berryman & Andrew Allison & Brian W-H Ng & Derek Abbott, 2013. "Automated Authorship Attribution Using Advanced Signal Classification Techniques," PLOS ONE, Public Library of Science, vol. 8(2), pages 1-12, February.
    9. Ahmed Shamsul Arefin & Renato Vimieiro & Carlos Riveros & Hugh Craig & Pablo Moscato, 2014. "An Information Theoretic Clustering Approach for Unveiling Authorship Affinities in Shakespearean Era Plays and Poems," PLOS ONE, Public Library of Science, vol. 9(10), pages 1-12, October.
    10. Sanda-Maria Avram & Mihai Oltean, 2022. "A Comparison of Several AI Techniques for Authorship Attribution on Romanian Texts," Mathematics, MDPI, vol. 10(23), pages 1-35, December.
    11. Matthew J. Schneider & Shawn Mankad, 2021. "A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content," Customer Needs and Solutions, Springer;Institute for Sustainable Innovation and Growth (iSIG), vol. 8(3), pages 66-83, September.
    12. Kargin, Vladislav, 2016. "On variation of word frequencies in Russian literary texts," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 445(C), pages 328-334.
    13. Haoran Zhu & Lei Lei, 2022. "The Research Trends of Text Classification Studies (2000–2020): A Bibliometric Analysis," SAGE Open, , vol. 12(2), pages 21582440221, April.
    14. Oleg Sobchuk & Artjoms Šeļa, 2024. "Computational thematics: comparing algorithms for clustering the genres of literary fiction," Palgrave Communications, Palgrave Macmillan, vol. 11(1), pages 1-12, December.
    15. Jennifer A. Byrne & Cyril Labbé, 2017. "Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(3), pages 1471-1493, March.
    16. de Arruda, Henrique F. & Marinho, Vanessa Q. & Lima, Thales S. & Amancio, Diego R. & Costa, Luciano da F., 2018. "An image analysis approach to text analytics based on complex networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 510(C), pages 110-120.
    17. Mihailo Škorić & Ranka Stanković & Milica Ikonić Nešić & Joanna Byszuk & Maciej Eder, 2022. "Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution," Mathematics, MDPI, vol. 10(5), pages 1-27, March.
    18. Matilde Trevisani & Arjuna Tuzzi, 2015. "A portrait of JASA: the History of Statistics through analysis of keyword counts in an early scientific journal," Quality & Quantity: International Journal of Methodology, Springer, vol. 49(3), pages 1287-1304, May.
    19. Catalin Stoean & Daniel Lichtblau, 2020. "Author Identification Using Chaos Game Representation and Deep Learning," Mathematics, MDPI, vol. 8(11), pages 1-18, November.
    20. Ullah, Farhan & Jabbar, Sohail & Al-Turjman, Fadi, 2020. "Programmers' de-anonymization using a hybrid approach of abstract syntax tree and deep learning," Technological Forecasting and Social Change, Elsevier, vol. 159(C).

    More about this item

    Keywords

    stylometry; linguistic features; hierarchical linear clustering; non-hierarchical non-linear clustering; distance metrics; variance; signal-noise ratio; poisson frequency distribution; TF.IDF term-frequency; SOM;
    All these keywords.

    JEL classification:

    • R00 - Urban, Rural, Regional, Real Estate, and Transportation Economics - - General - - - General
    • Z0 - Other Special Topics - - General

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:rfa:smcjnl:v:6:y:2018:i:2:p:83-102. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Redfame publishing (email available below). General contact details of provider: https://edirc.repec.org/data/cepflch.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.