Zipf's and Heaps' Laws for Tokens and LLM-generated Texts

Published 2025 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

The frequency distribution of words in human-written texts roughly follows a simple mathematical form known as Zipf’s law. Some-what less well known is the related Heaps’ law, which describes a sublinear power-law growth of vocabulary size with document size. We study the applicability of Zipf’s and Heaps’ laws to texts generated by Large Language Models (LLMs). We empirically show that Heaps’ and Zipf’s laws only hold for LLM-generated texts in a narrow model-dependent temperature range. These temperatures have an optimal value close to t = 1 for all the base models except the large Llama models, are higher for instruction-finetuned models and do not depend on the model size or prompting. This independently confirms the recent discovery of sampling temperature dependent phase transitions in LLM-generated texts.

PUBLICATION RECORD

Publication year
2025
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
Unknown publication date
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/2025.findings-emnlp.837
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

States of LLM-generated Texts and Phase Transitions between them
2025cited by this paper
Qwen2.5 Technical Report
2024cited by this paper
Autocorrelations Decay in Texts and Applicability Limits of Language Models
2023cited by this paper
Training language models to follow instructions with human feedback
2022cited by this paper
Finetuned Language Models Are Zero-Shot Learners
2021cited by this paper
The Curious Case of Neural Text Degeneration
2019cited by this paper
Evaluating Computational Language Models with Scaling Properties of Natural Language
2019cited by this paper
Mutual Information Scaling and Expressive Power of Sequence Models
2019cited by this paper
Improving Language Understanding by Generative Pre-Training
2018influential reference
Natural Language Statistical Features of LSTM-Generated Texts
2018cited by this paper
Do neural nets learn statistical laws behind natural language?
2017cited by this paper
Zipf’s word frequency law in natural language: A critical review and future directions
2014cited by this paper
A scaling law beyond Zipf's law and its relation to Heaps' law
2013cited by this paper
Modeling Statistical Properties of Written Text
2009cited by this paper
Power laws, Pareto distributions and Zipf's law
2005cited by this paper
A Brief History of Generative Models for Power Law and Lognormal Distributions
2004cited by this paper
The physics of phase transitions : concepts and applications
2002cited by this paper
Word frequency distributions
2002cited by this paper
Zipf and Heaps Laws' Coefficients Depend on Language
2001cited by this paper
A new algorithm for data compression
1994cited by this paper
I. Quantitative Linguistics
1992cited by this paper
Adaptive Mixtures of Local Experts
1991influential reference
Information retrieval, computational and theoretical aspects
1978cited by this paper

CITED BY

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
2026cites this paper
Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy
2026cites this paper