IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0005372.html
   My bibliography  Save this article

Modeling Statistical Properties of Written Text

Author

Listed:
  • M Ángeles Serrano
  • Alessandro Flammini
  • Filippo Menczer

Abstract

Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.

Suggested Citation

  • M Ángeles Serrano & Alessandro Flammini & Filippo Menczer, 2009. "Modeling Statistical Properties of Written Text," PLOS ONE, Public Library of Science, vol. 4(4), pages 1-8, April.
  • Handle: RePEc:plo:pone00:0005372
    DOI: 10.1371/journal.pone.0005372
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0005372
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0005372&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0005372?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Martin A. Nowak & Natalia L. Komarova & Partha Niyogi, 2002. "Computational and evolutionary aspects of language," Nature, Nature, vol. 417(6889), pages 611-617, June.
    2. A. Saichev & Y. Malevergne & D. Sornette, 2008. "Theory of Zipf's Law and of General Power Law Distributions with Gibrat's law of Proportional Growth," Papers 0808.1828, arXiv.org.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Servedio, Vito D.P. & Ferreira, Márcia R. & Reisz, Niklas & Costas, Rodrigo & Thurner, Stefan, 2023. "Scale-free growth in regional scientific capacity building explains long-term scientific dominance," Chaos, Solitons & Fractals, Elsevier, vol. 167(C).
    2. Bernardo Monechi & Ãlvaro Ruiz-Serrano & Francesca Tria & Vittorio Loreto, 2017. "Waves of novelties in the expansion into the adjacent possible," PLOS ONE, Public Library of Science, vol. 12(6), pages 1-18, June.
    3. Petersen, Alexander M. & Rotolo, Daniele & Leydesdorff, Loet, 2016. "A triple helix model of medical innovation: Supply, demand, and technological capabilities in terms of Medical Subject Headings," Research Policy, Elsevier, vol. 45(3), pages 666-681.
    4. Cui, Xue-Mei & Yoon, Chang No & Youn, Hyejin & Lee, Sang Hoon & Jung, Jean S. & Han, Seung Kee, 2017. "Dynamic burstiness of word-occurrence and network modularity in textbook systems," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 487(C), pages 103-110.
    5. Chen, Yanguang, 2012. "Zipf’s law, 1/f noise, and fractal hierarchy," Chaos, Solitons & Fractals, Elsevier, vol. 45(1), pages 63-73.
    6. Rodrick Wallace, 2024. "“Neuroscience†models of institutional conflict under fog, friction, and adversarial intent," The Journal of Defense Modeling and Simulation, , vol. 21(1), pages 75-86, January.
    7. Chen, Yanguang, 2012. "The mathematical relationship between Zipf’s law and the hierarchical scaling law," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 391(11), pages 3285-3299.
    8. Eduardo G Altmann & Janet B Pierrehumbert & Adilson E Motter, 2011. "Niche as a Determinant of Word Fate in Online Groups," PLOS ONE, Public Library of Science, vol. 6(5), pages 1-12, May.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Petra Štamfestová & Lukáš Sobíšek & Jiří Hnilica, 2023. "Firm Size Distribution in the Central European Context," Central European Business Review, Prague University of Economics and Business, vol. 2023(5), pages 151-175.
    2. Bakalis, Evangelos & Galani, Alexandra, 2012. "Modeling language evolution: Aromanian, an endangered language in Greece," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 391(20), pages 4963-4969.
    3. Safarzynska, Karolina & van den Bergh, Jeroen C.J.M., 2011. "Beyond replicator dynamics: Innovation-selection dynamics and optimal diversity," Journal of Economic Behavior & Organization, Elsevier, vol. 78(3), pages 229-245, May.
    4. Michael J Weir & Catherine M Ashcraft & Natallia Leuchanka Diessner & Bridie McGreavy & Emily Vogler & Todd Guilfoos, 2020. "Language effects on bargaining," PLOS ONE, Public Library of Science, vol. 15(3), pages 1-20, March.
    5. Stanisz, Tomasz & Drożdż, Stanisław & Kwapień, Jarosław, 2023. "Universal versus system-specific features of punctuation usage patterns in major Western languages," Chaos, Solitons & Fractals, Elsevier, vol. 168(C).
    6. Adam Gifford, 2012. "John R. Searle: The making of the social world: the structure of human civilization," Journal of Bioeconomics, Springer, vol. 14(1), pages 95-99, April.
    7. David Bodoff & Ron Bekkerman & Julie Dai, 2017. "Evolution of language: An empirical study at eBay Big Data Lab," PLOS ONE, Public Library of Science, vol. 12(12), pages 1-17, December.
    8. Dirk Helbing & Anders Johansson, 2010. "Cooperation, Norms, and Revolutions: A Unified Game-Theoretical Approach," PLOS ONE, Public Library of Science, vol. 5(10), pages 1-15, October.
    9. Patriarca, Marco & Heinsalu, Els, 2009. "Influence of geography on language competition," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 388(2), pages 174-186.
    10. Simon M. Huttegger & Kevin J. S. Zollman, 2016. "The Robustness of Hybrid Equilibria in Costly Signaling Games," Dynamic Games and Applications, Springer, vol. 6(3), pages 347-358, September.
    11. Marcelo A Montemurro & Damián H Zanette, 2011. "Universal Entropy of Word Ordering Across Linguistic Families," PLOS ONE, Public Library of Science, vol. 6(5), pages 1-9, May.
    12. Karolina Safarzyńska & Jeroen Bergh, 2013. "An evolutionary model of energy transitions with interactive innovation-selection dynamics," Journal of Evolutionary Economics, Springer, vol. 23(2), pages 271-293, April.
    13. Montebruno, Piero & Bennett, Robert J. & van Lieshout, Carry & Smith, Harry, 2019. "A tale of two tails: Do Power Law and Lognormal models fit firm-size distributions in the mid-Victorian era?," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 523(C), pages 858-875.
    14. Parshad, Rana D. & Bhowmick, Suman & Chand, Vineeta & Kumari, Nitu & Sinha, Neha, 2016. "What is India speaking? Exploring the “Hinglish” invasion," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 449(C), pages 375-389.
    15. Domingo Docampo & Lawrence Cram, 2017. "Academic performance and institutional resources: a cross-country analysis of research universities," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(2), pages 739-764, February.
    16. G. A. Kohring, 2009. "Complex Dependencies In Large Software Systems," Advances in Complex Systems (ACS), World Scientific Publishing Co. Pte. Ltd., vol. 12(06), pages 565-581.
    17. Nie, Lin-Fei & Teng, Zhi-Dong & Nieto, Juan J. & Jung, Il Hyo, 2015. "State impulsive control strategies for a two-languages competitive model with bilingualism and interlinguistic similarity," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 430(C), pages 136-147.
    18. Rodriguez, E. & Aguilar-Cornejo, M. & Femat, R. & Alvarez-Ramirez, J., 2014. "Scale and time dependence of serial correlations in word-length time series of written texts," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 414(C), pages 378-386.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0005372. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.