IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v105y2015i3d10.1007_s11192-015-1637-z.html
   My bibliography  Save this article

Comparing the topological properties of real and artificially generated scientific manuscripts

Author

Listed:
  • Diego Raphael Amancio

    (University of São Paulo)

Abstract

Recent years have witnessed the increase of competition in science. While promoting the quality of research in many cases, an intense competition among scientists can also trigger unethical scientific behaviors. To increase the total number of published papers, some authors even resort to software tools that are able to produce grammatical, but meaningless scientific manuscripts. Because automatically generated papers can be misunderstood as real papers, it becomes of paramount importance to develop means to identify these scientific frauds. In this paper, I devise a methodology to distinguish real manuscripts from those generated with SCIGen, an automatic paper generator. Upon modeling texts as complex networks (CN), it was possible to discriminate real from fake papers with at least 89 % of accuracy. A systematic analysis of features relevance revealed that the accessibility and betweenness were useful in particular cases, even though the relevance depended upon the dataset. The successful application of the methods described here show, as a proof of principle, that network features can be used to identify scientific gibberish papers. In addition, the CN-based approach can be combined in a straightforward fashion with traditional statistical language processing methods to improve the performance in identifying artificially generated papers.

Suggested Citation

  • Diego Raphael Amancio, 2015. "Comparing the topological properties of real and artificially generated scientific manuscripts," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(3), pages 1763-1779, December.
  • Handle: RePEc:spr:scient:v:105:y:2015:i:3:d:10.1007_s11192-015-1637-z
    DOI: 10.1007/s11192-015-1637-z
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-015-1637-z
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-015-1637-z?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Emilio Ferrara & Alfonso E. Romero, 2013. "Scientific impact evaluation and the effect of self-citations: Mitigating the bias by discounting the h-index," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 64(11), pages 2332-2339, November.
    2. Yu, Tian & Yu, Guang & Wang, Ming-Yang, 2014. "Classification method for detecting coercive self-citation in journals," Journal of Informetrics, Elsevier, vol. 8(1), pages 123-135.
    3. Ben Van Calster, 2012. "It takes time: A remarkable example of delayed recognition," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 63(11), pages 2341-2344, November.
    4. Wu, Yan & Fu, Tom Z.J. & Chiu, Dah Ming, 2014. "Generalized preferential attachment considering aging," Journal of Informetrics, Elsevier, vol. 8(3), pages 650-658.
    5. Finardi, Ugo, 2013. "Correlation between Journal Impact Factor and Citation Performance: An experimental study," Journal of Informetrics, Elsevier, vol. 7(2), pages 357-370.
    6. Amancio, Diego R. & Nunes, Maria G.V. & Oliveira, Osvaldo N. & Costa, Luciano da F., 2012. "Extractive summarization using complex networks and syntactic dependency," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 391(4), pages 1855-1864.
    7. Diego Raphael Amancio & Cesar Henrique Comin & Dalcimar Casanova & Gonzalo Travieso & Odemir Martinez Bruno & Francisco Aparecido Rodrigues & Luciano da Fontoura Costa, 2014. "A Systematic Comparison of Supervised Classifiers," PLOS ONE, Public Library of Science, vol. 9(4), pages 1-14, April.
    8. Christoph Bartneck & Servaas Kokkelmans, 2011. "Detecting h-index manipulation through self-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 87(1), pages 85-98, April.
    9. Amancio, D.R. & Nunes, M.G.V. & Oliveira, O.N. & Pardo, T.A.S. & Antiqueira, L. & da F. Costa, L., 2011. "Using metrics from complex networks to evaluate machine translation," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 390(1), pages 131-142.
    10. Diego R Amancio, 2015. "Probing the Topological Properties of Complex Networks Modeling Short Written Texts," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-17, February.
    11. Paul Ginsparg, 2014. "ArXiv screens spot fake papers," Nature, Nature, vol. 508(7494), pages 44-44, April.
    12. Ben Calster, 2012. "It takes time: A remarkable example of delayed recognition," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(11), pages 2341-2344, November.
    13. Hajra, Kamalika Basu & Sen, Parongama, 2005. "Aging in citation networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 346(1), pages 44-48.
    14. Antonio García-Romero & José Manuel Estrada-Lorenzo, 2014. "A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(1), pages 381-396, October.
    15. Wolfgang Glänzel & Balázs Schlemmer & Bart Thijs, 2003. "Better late than never? On the chance to become highly cited only beyond the standard bibliometric time horizon," Scientometrics, Springer;Akadémiai Kiadó, vol. 58(3), pages 571-586, November.
    16. Liu, Haitao, 2008. "The complexity of Chinese syntactic dependency networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 387(12), pages 3048-3058.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Jennifer A. Byrne & Cyril Labbé, 2017. "Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(3), pages 1471-1493, March.
    2. Jorge A. V. Tohalino & Laura V. C. Quispe & Diego R. Amancio, 2021. "Analyzing the relationship between text features and grants productivity," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 4255-4275, May.
    3. Silva, Filipi N. & Amancio, Diego R. & Bardosova, Maria & Costa, Luciano da F. & Oliveira, Osvaldo N., 2016. "Using network science and text analytics to produce surveys in a scientific topic," Journal of Informetrics, Elsevier, vol. 10(2), pages 487-502.
    4. Dejian Yu & Wanru Wang & Shuai Zhang & Wenyu Zhang & Rongyu Liu, 2017. "Hybrid self-optimized clustering model based on citation links and textual features to detect research topics," PLOS ONE, Public Library of Science, vol. 12(10), pages 1-21, October.
    5. Guillaume Cabanac & Cyril Labbé, 2021. "Prevalence of nonsensical algorithmically generated papers in the scientific literature," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(12), pages 1461-1476, December.
    6. Shang, Ronghua & Zhang, Weitong & Zhang, Jingwen & Feng, Jie & Jiao, Licheng, 2022. "Local community detection based on higher-order structure and edge information," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 587(C).
    7. Brito, Ana C.M. & Silva, Filipi N. & de Arruda, Henrique F. & Comin, Cesar H. & Amancio, Diego R. & Costa, Luciano da F., 2021. "Classification of abrupt changes along viewing profiles of scientific articles," Journal of Informetrics, Elsevier, vol. 15(2).
    8. Tohalino, Jorge V. & Amancio, Diego R., 2018. "Extractive multi-document summarization using multilayer networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 503(C), pages 526-539.
    9. de Arruda, Henrique F. & Silva, Filipi N. & Comin, Cesar H. & Amancio, Diego R. & Costa, Luciano da F., 2019. "Connecting network science and information theory," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 515(C), pages 641-648.
    10. Nguyen Minh Tien & Cyril Labbé, 2018. "Detecting automatically generated sentences with grammatical structure similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1247-1271, August.
    11. Corrêa, Edilson A. & Amancio, Diego R., 2019. "Word sense induction using word embeddings and community detection in complex networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 523(C), pages 180-190.
    12. Corrêa, Edilson A. & Marinho, Vanessa Q. & Amancio, Diego R., 2020. "Semantic flow in language networks discriminates texts by genre and publication date," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 557(C).
    13. Adam Day, 2022. "Exploratory analysis of text duplication in peer-review reveals peer-review fraud and paper mills," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(10), pages 5965-5987, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Adil El Aichouchi & Philippe Gorry, 2018. "Delayed recognition of Judah Folkman’s hypothesis on tumor angiogenesis: when a Prince awakens a Sleeping Beauty by self-citation," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(1), pages 385-399, July.
    2. Diego R Amancio, 2015. "Probing the Topological Properties of Complex Networks Modeling Short Written Texts," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-17, February.
    3. Lutz Bornmann & Adam Y. Ye & Fred Y. Ye, 2018. "Identifying “hot papers” and papers with “delayed recognition” in large-scale datasets by using dynamically normalized citation impact scores," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 655-674, August.
    4. Amancio, Diego R. & Oliveira Jr., Osvaldo N. & Costa, Luciano da F., 2012. "Structure–semantics interplay in complex networks and its effects on the predictability of similarity in texts," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 391(18), pages 4406-4419.
    5. Hui Fang, 2018. "Analysing the variation tendencies of the numbers of yearly citations for sleeping beauties in science by using derivative analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 115(2), pages 1051-1070, May.
    6. Miura, Takahiro & Asatani, Kimitaka & Sakata, Ichiro, 2023. "Revisiting the uniformity and inconsistency of slow-cited papers in science," Journal of Informetrics, Elsevier, vol. 17(1).
    7. Martin Szomszor & David A. Pendlebury & Jonathan Adams, 2020. "How much is too much? The difference between research influence and self-citation excess," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(2), pages 1119-1147, May.
    8. Vîiu, Gabriel-Alexandru, 2016. "A theoretical evaluation of Hirsch-type bibliometric indicators confronted with extreme self-citation," Journal of Informetrics, Elsevier, vol. 10(2), pages 552-566.
    9. D. R. Amancio & M. G. V. Nunes & O. N. Oliveira & L. F. Costa, 2012. "Using complex networks concepts to assess approaches for citations in scientific papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 91(3), pages 827-842, June.
    10. S. R. Goldberg & H. Anthony & T. S. Evans, 2015. "Modelling citation networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(3), pages 1577-1604, December.
    11. Adilson Vital & Diego R. Amancio, 2022. "A comparative analysis of local similarity metrics and machine learning approaches: application to link prediction in author citation networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(10), pages 6011-6028, October.
    12. Jorge A. V. Tohalino & Laura V. C. Quispe & Diego R. Amancio, 2021. "Analyzing the relationship between text features and grants productivity," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 4255-4275, May.
    13. Ferraz de Arruda, Henrique & Reia, Sandro Martinelli & Silva, Filipi Nascimento & Amancio, Diego Raphael & da Fontoura Costa, Luciano, 2022. "Finding contrasting patterns in rhythmic properties between prose and poetry," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 598(C).
    14. José Osvaldo De Sordi & Marco Antonio Conejero & Manuel Meireles, 2016. "Bibliometric indicators in the context of regional repositories: proposing the D-index," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(1), pages 235-258, April.
    15. Silvio Peroni & Paolo Ciancarini & Aldo Gangemi & Andrea Giovanni Nuzzolese & Francesco Poggi & Valentina Presutti, 2020. "The practice of self-citations: a longitudinal study," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(1), pages 253-282, April.
    16. Nguyen Minh Tien & Cyril Labbé, 2018. "Detecting automatically generated sentences with grammatical structure similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1247-1271, August.
    17. Hélder Ferreira & Aurora A.C. Teixeira, 2013. "‘Welcome to the experience economy’: assessing the influence of customer experience literature through bibliometric analysis," FEP Working Papers 481, Universidade do Porto, Faculdade de Economia do Porto.
    18. Taşkın, Zehra & Doğan, Güleda & Kulczycki, Emanuel & Zuccala, Alesia Ann, 2021. "Self-Citation Patterns of Journals Indexed in the Journal Citation Reports," Journal of Informetrics, Elsevier, vol. 15(4).
    19. Quispe, Laura V.C. & Tohalino, Jorge A.V. & Amancio, Diego R., 2021. "Using virtual edges to improve the discriminability of co-occurrence text networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 562(C).
    20. Tohalino, Jorge V. & Amancio, Diego R., 2018. "Extractive multi-document summarization using multilayer networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 503(C), pages 526-539.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:105:y:2015:i:3:d:10.1007_s11192-015-1637-z. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.