IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v13y2020i1p3-d468370.html
   My bibliography  Save this article

Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks

Author

Listed:
  • Aleksandr Romanov

    (Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia)

  • Anna Kurtukova

    (Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia)

  • Alexander Shelupanov

    (Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia)

  • Anastasia Fedotova

    (Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, Russia)

  • Valery Goncharov

    (Department of Automation and Robotics, The National Research Tomsk Polytechnic University, 634050 Tomsk, Russia)

Abstract

The article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text authorship methods are particularly useful for information security and forensics. For example, such methods can be used to identify authors of suicide notes, and other texts are subjected to forensic examinations. Another area of application is plagiarism detection. Plagiarism detection is a relevant issue both for the field of intellectual property protection in the digital space and for the educational process. The article describes identifying the author of the Russian-language text using support vector machine (SVM) and deep neural network architectures (long short-term memory (LSTM), convolutional neural networks (CNN) with attention, Transformer). The results show that all the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy. The average accuracy of SVM reaches 96%. This is due to thoroughly chosen parameters and feature space, which includes statistical and semantic features (including those extracted as a result of an aspect analysis). Deep neural networks are inferior to SVM in accuracy and reach only 93%. The study also includes an evaluation of the impact of attacks on the method on models’ accuracy. Experiments show that the SVM-based methods are unstable to deliberate text anonymization. In comparison, the loss in accuracy of deep neural networks does not exceed 20%. Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved.

Suggested Citation

  • Aleksandr Romanov & Anna Kurtukova & Alexander Shelupanov & Anastasia Fedotova & Valery Goncharov, 2020. "Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks," Future Internet, MDPI, vol. 13(1), pages 1-16, December.
  • Handle: RePEc:gam:jftint:v:13:y:2020:i:1:p:3-:d:468370
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/13/1/3/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/13/1/3/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:13:y:2020:i:1:p:3-:d:468370. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.