IDEAS home Printed from https://ideas.repec.org/p/hum/wpaper/sfb649dp2007-057.html
   My bibliography  Save this paper

Conditional Complexity of Compression for Authorship Attribution

Author

Listed:
  • Mikhail B. Malyutov
  • Chammi I. Wickramasinghe
  • Sufeng Li

Abstract

We introduce new stylometry tools based on the sliced conditional compression complexity of literary texts which are inspired by the nearly optimal application of the incomputable Kolmogorov conditional complexity (and presumably approximates it). Whereas other stylometry tools can occasionally be very close for different authors, our statistic is apparently strictly minimal for the true author, if the query and training texts are sufficiently large, compressor is sufficiently good and sampling bias is avoided (as in the poll samplings). We tune it and test its performance on attributing the Federalist papers (Madison vs. Hamilton). Our results confirm the previous attribution of Federalist papers by Mosteller and Wallace (1964) to Madison using the Naive Bayes classifier and the same attribution based on alternative classifiers such as SVM, and the second order Markov model of language. Then we apply our method for studying the attribution of the early poems from the Shakespeare Canon and the continuation of Marlowe’s poem ‘Hero and Leander’ ascribed to G. Chapman.

Suggested Citation

  • Mikhail B. Malyutov & Chammi I. Wickramasinghe & Sufeng Li, 2007. "Conditional Complexity of Compression for Authorship Attribution," SFB 649 Discussion Papers SFB649DP2007-057, Sonderforschungsbereich 649, Humboldt University, Berlin, Germany.
  • Handle: RePEc:hum:wpaper:sfb649dp2007-057
    as

    Download full text from publisher

    File URL: http://sfb649.wiwi.hu-berlin.de/papers/pdf/SFB649DP2007-057.pdf
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    compression complexity; authorship attribution.;

    JEL classification:

    • C12 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Hypothesis Testing: General
    • C15 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Statistical Simulation Methods: General
    • C63 - Mathematical and Quantitative Methods - - Mathematical Methods; Programming Models; Mathematical and Simulation Modeling - - - Computational Techniques

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hum:wpaper:sfb649dp2007-057. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: RDC-Team (email available below). General contact details of provider: https://edirc.repec.org/data/sohubde.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.