IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0170527.html
   My bibliography  Save this article

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks

Author

Listed:
  • Camilo Akimushkin
  • Diego Raphael Amancio
  • Osvaldo Novais Oliveira Jr.

Abstract

Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.

Suggested Citation

  • Camilo Akimushkin & Diego Raphael Amancio & Osvaldo Novais Oliveira Jr., 2017. "Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks," PLOS ONE, Public Library of Science, vol. 12(1), pages 1-15, January.
  • Handle: RePEc:plo:pone00:0170527
    DOI: 10.1371/journal.pone.0170527
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0170527
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0170527&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0170527?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. D. R. Amancio & M. G. V. Nunes & O. N. Oliveira & L. F. Costa, 2012. "Using complex networks concepts to assess approaches for citations in scientific papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 91(3), pages 827-842, June.
    2. Ramon Ferrer i Cancho & Ricard V. Solé, 2001. "The Small-World of Human Language," Working Papers 01-03-016, Santa Fe Institute.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Samuel Zanferdini Oliva & Livia Oliveira-Ciabati & Denise Gazotto Dezembro & Mário Sérgio Adolfi Júnior & Maísa Carvalho Silva & Hugo Cesar Pessotti & Juliana Tarossi Pollettini, 2021. "Text structuring methods based on complex network: a systematic review," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1471-1493, February.
    2. Shakibian, Hadi & Charkari, Nasrollah Moghadam, 2018. "Statistical similarity measures for link prediction in heterogeneous complex networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 501(C), pages 248-263.
    3. Corrêa, Edilson A. & Amancio, Diego R., 2019. "Word sense induction using word embeddings and community detection in complex networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 523(C), pages 180-190.
    4. Joseph, Simmi Marina & Citraro, Salvatore & Morini, Virginia & Rossetti, Giulio & Stella, Massimo, 2023. "Cognitive network neighborhoods quantify feelings expressed in suicide notes and Reddit mental health communities," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 610(C).
    5. Liu, Yanyan & Li, Keping & Yan, Dongyang & Gu, Shuang, 2022. "A network-based CNN model to identify the hidden information in text data," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 590(C).
    6. Espitia, Diego & Larralde, Hernán, 2020. "Universal and non-universal text statistics: Clustering coefficient for language identification," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 553(C).
    7. Garg, Muskan & Kumar, Mukesh, 2018. "The structure of word co-occurrence network for microblogs," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 512(C), pages 698-720.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Woon Peng Goh & Kang-Kwong Luke & Siew Ann Cheong, 2018. "Functional shortcuts in language co-occurrence networks," PLOS ONE, Public Library of Science, vol. 13(9), pages 1-18, September.
    2. Ted Briscoe, 2008. "Language learning, power laws, and sexual selection," Mind & Society: Cognitive Studies in Economics and Social Sciences, Springer;Fondazione Rosselli, vol. 7(1), pages 65-76, June.
    3. Diego R Amancio, 2015. "Probing the Topological Properties of Complex Networks Modeling Short Written Texts," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-17, February.
    4. Corrêa Jr., Edilson A. & Silva, Filipi N. & da F. Costa, Luciano & Amancio, Diego R., 2017. "Patterns of authors contribution in scientific manuscripts," Journal of Informetrics, Elsevier, vol. 11(2), pages 498-510.
    5. Amancio, Diego Raphael & Oliveira, Osvaldo Novais & da Fontoura Costa, Luciano, 2012. "Three-feature model to reproduce the topology of citation networks and the effects from authors’ visibility on their h-index," Journal of Informetrics, Elsevier, vol. 6(3), pages 427-434.
    6. Li, Jianyu & Zhou, Jie, 2007. "Chinese character structure analysis based on complex networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 380(C), pages 629-638.
    7. Xiao, Wenjun & Liu, Yanxia & Chen, Guanrong, 2014. "Characterizing vertex-degree sequences in scale-free networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 404(C), pages 291-295.
    8. Adilson Vital & Diego R. Amancio, 2022. "A comparative analysis of local similarity metrics and machine learning approaches: application to link prediction in author citation networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(10), pages 6011-6028, October.
    9. Zhao, Qihang & Feng, Xiaodong, 2022. "Utilizing citation network structure to predict paper citation counts: A Deep learning approach," Journal of Informetrics, Elsevier, vol. 16(1).
    10. Akimushkin, Camilo & Amancio, Diego R. & Oliveira, Osvaldo N., 2018. "On the role of words in the network structure of texts: Application to authorship attribution," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 495(C), pages 49-58.
    11. Danhao Zhu & Dongbo Wang & Saeed-Ul Hassan & Peter Haddawy, 2013. "Small-world phenomenon of keywords network based on complex network," Scientometrics, Springer;Akadémiai Kiadó, vol. 97(2), pages 435-442, November.
    12. Liu, Yanyan & Li, Keping & Yan, Dongyang & Gu, Shuang, 2022. "A network-based CNN model to identify the hidden information in text data," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 590(C).
    13. Cui, Xue-Mei & Yoon, Chang No & Youn, Hyejin & Lee, Sang Hoon & Jung, Jean S. & Han, Seung Kee, 2017. "Dynamic burstiness of word-occurrence and network modularity in textbook systems," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 487(C), pages 103-110.
    14. Sheng, Long & Li, Chunguang, 2009. "English and Chinese languages as weighted complex networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 388(12), pages 2561-2570.
    15. Diego R. Amancio & Osvaldo N. Oliveira jr & Luciano F. Costa, 2015. "Topological-collaborative approach for disambiguating authors’ names in collaborative networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(1), pages 465-485, January.
    16. Petralia, Sergio & Kemeny, Thomas & Storper, Michael, 2023. "The transformative effects of tacit technological knowledge," LSE Research Online Documents on Economics 120154, London School of Economics and Political Science, LSE Library.
    17. STANKOVA, Marija & MARTENS, David & PROVOST, Foster, 2015. "Classification over bipartite graphs through projection," Working Papers 2015001, University of Antwerp, Faculty of Business and Economics.
    18. Xiomara S. Q. Chacon & Thiago C. Silva & Diego R. Amancio, 2020. "Comparing the impact of subfields in scientific journals," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 625-639, October.
    19. Ghosh, Dipak & Chakraborty, Sayantan & Samanta, Shukla, 2019. "Study of translational effect in Tagore’s Gitanjali using Chaos based Multifractal analysis technique," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 523(C), pages 1343-1354.
    20. Kavitha Karunan & Hiran H. Lathabai & Thara Prabhakaran, 2017. "Discovering interdisciplinary interactions between two research fields using citation networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 113(1), pages 335-367, October.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0170527. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.