Modeling Statistical Properties of Written Text

M. Serrano,A. Flammini,Filippo Menczer,Filippo Menczer

Published 2009 in PLoS ONE

ABSTRACT

Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.

PUBLICATION RECORD

Publication year
2009
Venue
PLoS ONE
Publication date
2009-04-29
Fields of study
Medicine, Linguistics, Computer Science
Identifiers
DOI 10.1371/journal.pone.0005372 PMID 19401762 PMCID 2670513
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Text Mining
2017cited by this paper
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Second Edition
2009cited by this paper
Latent Dirichlet Allocation
2009cited by this paper
Intelligence and Security Informatics for International Security: Information Sharing and Data Mining
2009cited by this paper
Languages Evolve in Punctuational Bursts
2008cited by this paper
Theory of Zipf's Law and of General Power Law Distributions with Gibrat's law of Proportional Growth
2008cited by this paper
Power-Law Distributions in Empirical Data
2007cited by this paper
The faculty of language
2007cited by this paper
Language and Mind: Index
2006cited by this paper
Text Mining for Biology and Biomedicine
2006cited by this paper
Semiotic dynamics and collaborative tagging
2006cited by this paper
The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
2006cited by this paper
Analyzing Entities and Topics in News Articles Using Statistical Topic Models
2006cited by this paper
On Zipf’s law and rank distributions in linguistics and semiotics
2006cited by this paper
Hierarchical structures induce long-range dynamical correlations in written texts.
2006cited by this paper
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution
2006cited by this paper
Pachinko allocation: DAG-structured mixture models of topic correlations
2006cited by this paper
Contextual Diversity, Not Word Frequency, Determines Word-Naming and Lexical Decision Times
2006cited by this paper
Scale-free network growth by ranking.
2006cited by this paper
Language and the Mind
2005cited by this paper
Modeling word burstiness using the Dirichlet distribution
2005cited by this paper
A Bayesian Mixture Model for Term Re-occurrence and Burstiness
2005cited by this paper
Serial mechanisms in lexical access: the rank hypothesis.
2004cited by this paper
Evolution of document networks
2004cited by this paper
Integrating Topics and Syntax
2004cited by this paper
Finding scientific topics
2004cited by this paper
Parametric Models of Linguistic Count Data
2003cited by this paper
Word frequency distributions
2002cited by this paper
Computational and evolutionary aspects of language
2002cited by this paper
Review of "Mining the Web: discovering knowledge from hypertext data" by Soumen Chakrabati. Morgan Kaufman 2003.
2002cited by this paper
The faculty of language: what is it, who has it, and how did it evolve?
2002cited by this paper
Growing and navigating the small world Web by local content
2002cited by this paper
Bursty and Hierarchical Structure in Streams
2002cited by this paper
Universal behavior of load distribution in scale-free networks.
2001cited by this paper
Statistical mechanics of complex networks
2001cited by this paper
Foundations of Statistical Natural Language Processing
2001cited by this paper
Probabilistic latent semantic indexing
1999cited by this paper
A study of retrospective and on-line event detection
1998cited by this paper
On-line new event detection and tracking
1998cited by this paper
An algorithm for suffix stripping
1997cited by this paper
Distribution of content words and phrases in text and language modelling
1996cited by this paper
Poisson mixtures
1995cited by this paper
Introduction to Modern Information Retrieval
1983cited by this paper
Research and Development in Information Retrieval
1982cited by this paper
Natural-Language Processing
1982cited by this paper
Information retrieval, computational and theoretical aspects
1978cited by this paper
A general theory of bibliometric and other cumulative advantage processes
1976cited by this paper
Programming languages in mechanized documentation
1971cited by this paper
ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS
1955cited by this paper
Human behavior and the principle of least effort
1949cited by this paper

CITED BY

Component systems: do null models explain everything?
2026cites this paper
The Science of the New
2026cites this paper
Cognitive Limits Shape Language Statistics
2025cites this paper
What Does YouTube Advise Students About Bypassing AI-Text Detection Tools? A Pragmatic Analysis
2025cites this paper
Complete asymptotic type-token relationship for growing complex systems with inverse power-law count rankings
2025cites this paper
Zipf's and Heaps' Laws for Tokens and LLM-generated Texts
2025cites this paper
CAT-LLM: Style-enhanced Large Language Models with Text Style Definition for Chinese Article-style Transfer
2024cites this paper
Entropy and type-token ratio in gigaword corpora
2024cites this paper
Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach
2024cites this paper
A Recognition System for Devanagari Handwritten Digits Using CNN
2024cites this paper
Predict the Next Word: <Humans exhibit uncertainty in this task and language models _____>
2024cites this paper
Scale-free growth in regional scientific capacity building explains long-term scientific dominance
2023cites this paper
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
2023cites this paper
Extending Heaps' Law for Sublinear Vocabulary Growth on a Logarithmic Scale
2023cites this paper
Social interactions affect discovery processes
2022cites this paper
Discourse with few words: Coherence statistics, parent-infant actions on objects, and object names
2022cites this paper
Data Distributional Properties Drive Emergent In-Context Learning in Transformers
2022cites this paper
Model Criticism for Long-Form Text Generation
2022cites this paper
“Neuroscience” models of institutional conflict under fog, friction, and adversarial intent
2022cites this paper
Ways to build text collections for training classifiers
2021cites this paper
Heaps’ law and vocabulary richness in the history of classical music harmony
2021cites this paper
Negative correlation of word rank sequence in written texts
2021cites this paper
A Framework to Understand Attitudes towards Immigration through Twitter
2021cites this paper
Способы построения текстовых коллекций для обучения классификаторов
2021cites this paper
Correlates in the evolution of phonotactic diversity in English: Linguistic structure, demographics, and network characteristics
2021cites this paper
Related Statistical Universals
2021cites this paper
Modifications of Simon text model
2020cites this paper
Beauty in artistic expressions through the eyes of networks and physics
2020cites this paper
Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts
2020cites this paper
A Unified Framework for Processing Exact and Approximate Top-k Set Similarity Join
2020cites this paper
First Passage Times
2019cites this paper
An Overview of the Available Corpora for Evaluation of the Automatic Keyword Extraction Algorithms
2018cites this paper
Geometric randomization of real networks with prescribed degree sequence
2018cites this paper
Feature space learning model
2018cites this paper
Zipf’s, Heaps’ and Taylor’s Laws are Determined by the Expansion into the Adjacent Possible
2018cites this paper
Quantity and Diversity: Simulating Early Word Learning Environments.
2018cites this paper
Dynamic burstiness of word-occurrence and network modularity in textbook systems
2017cites this paper
Dynamics on Expanding Spaces: Modeling the Emergence of Novelties
2017cites this paper
Human exploration of complex knowledge spaces
2017cites this paper
Real-world visual statistics and infants' first-learned object names
2017cites this paper
Multifractal correlations in natural language written texts: Effects of language family and long word statistics
2017cites this paper
A Feature Space Learning Model Based on Semi-Supervised Clustering
2017cites this paper
A statistical test for the Zipf's law by deviations from the Heaps' law
2017cites this paper
Context as an Organizing Principle of the Lexicon
2017cites this paper
Waves of novelties in the expansion into the adjacent possible
2017cites this paper
Long-Range Correlation Underlying Childhood Language and Generative Models
2017cites this paper
Dependence of exponents on text length versus finite-size scaling for word-frequency distributions.
2017cites this paper
Probability distribution of intersymbol distances in random symbolic sequences: Applications to improving detection of keywords in texts and of amino acid clustering in proteins.
2016cites this paper
Verifying Heaps' law using Google Books Ngram data
2016cites this paper
Supporting online health queries by modeling patterns of creation, modification and retrieval of medical knowledge
2016cites this paper
Empirical analysis of vocabulary growth based on word connectivity
2015cites this paper
Zipf and Related Scaling Laws. 2. Literature Overview of Applications in Linguistics
2015cites this paper
Similarity of symbol frequency distributions with heavy tails
2015cites this paper
The Interaction of 'Supply', 'Demand', and 'Technological Capabilities' in terms of Medical Subject Headings: A Triple Helix Model of Medical Innovation
2015cites this paper
A Triple Helix Model of Medical Innovation: Supply, Demand, and Technological Capabilities in Terms of Medical Subject Headings
2015cites this paper
First Women, Second Sex: Gender Bias in Wikipedia
2015influential citation
Semantic Space as a Metapopulation System: Modelling the Wikipedia Information Flow Network
2015cites this paper
Statistical laws in linguistics
2015cites this paper
Big data and Wikipedia research: social science knowledge across disciplinary divides
2015cites this paper
Semantic facilitation in bilingual first language acquisition.
2015cites this paper
Bridging Between Information Retrieval and Databases
2014cites this paper
Statistical models for the analysis of short user-generated documents
2014cites this paper
Provenance, propagation and quality of biological annotation
2014influential citation
Universals versus historical contingencies in lexical evolution
2014cites this paper
Spam Detection : An Unsupervised Approach using Generative Models
2014cites this paper
Scaling laws in human speech, decreasing emergence of new words and a generalized model
2014cites this paper
Scaling laws and fluctuations in the statistics of word frequencies
2014cites this paper
The Matthew effect in empirical data
2014cites this paper
UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society Title Language Acquisition of Bilingual Children : A Network Analysis
2014cites this paper
Understanding Zipf's law of word frequencies through sample-space collapse in sentence formation
2014cites this paper
Regulation of burstiness by network-driven activation
2014cites this paper
Log-Log Convexity of Type-Token Growth in Zipf's Systems.
2014cites this paper
Statistical universality and the method of Poissonian randomizations
2013cites this paper
Graphical law beneath each written natural language
2013cites this paper
Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis
2013cites this paper
Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript
2013cites this paper
Deviation of Zipf's and Heaps' Laws in Human Languages with Limited Dictionary Sizes
2013cites this paper
The dynamics of correlated novelties
2013cites this paper
Identifying Trends in Word Frequency Dynamics
2013cites this paper
Vocabulary influences older and younger listeners' processing of dysarthric speech.
2013cites this paper
An Introduction to the Novel Challenges in Information Retrieval for Social Media
2013cites this paper
DOCUMENT CLUSTERING WITH BURSTY INFORMATION
2013cites this paper
A scaling law beyond Zipf's law and its relation to Heaps' law
2013cites this paper
Innovation and nested preferential growth in chess playing behavior
2013cites this paper
Lognormal distributions of user post lengths in Internet discussions - a consequence of the Weber-Fechner law?
2013cites this paper
Computational Linguistic Models of Deceptive Opinion Spam
2013cites this paper
A Practical Approach to Language Complexity: A Wikipedia Case Study
2012cites this paper
Power-law connections: From Zipf to Heaps and beyond
2012cites this paper
Geometric theory for Weibull's distribution.
2012cites this paper
Stochastic model for the vocabulary growth in natural languages
2012influential citation
Languages cool as they expand: Allometric scaling and the decreasing need for new words
2012cites this paper
Value Production in a Collaborative Environment
2012cites this paper
CLUSTERING WITH BURSTY INFORMATION
2012cites this paper
An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
2012cites this paper
Emergent semantics from game-induced folksonomies
2012cites this paper
Interdisciplinary applications of statistical physics to complex systems: seismic physics, econophysics, and sociophysics
2012cites this paper
Investigating the Statistical Properties of User-Generated Documents
2011cites this paper
The growth statistics of Zipfian ensembles: Beyond Heaps’ law
2011cites this paper
The dynamic features of Delicious, Flickr, and YouTube
2011cites this paper
Zipf’s law, 1/f noise, and fractal hierarchy
2011cites this paper