The frequency distribution of words in human-written texts roughly follows a simple mathematical form known as Zipf’s law. Some-what less well known is the related Heaps’ law, which describes a sublinear power-law growth of vocabulary size with document size. We study the applicability of Zipf’s and Heaps’ laws to texts generated by Large Language Models (LLMs). We empirically show that Heaps’ and Zipf’s laws only hold for LLM-generated texts in a narrow model-dependent temperature range. These temperatures have an optimal value close to t = 1 for all the base models except the large Llama models, are higher for instruction-finetuned models and do not depend on the model size or prompting. This independently confirms the recent discovery of sampling temperature dependent phase transitions in LLM-generated texts.
Zipf's and Heaps' Laws for Tokens and LLM-generated Texts
Published 2025 in Conference on Empirical Methods in Natural Language Processing
ABSTRACT
PUBLICATION RECORD
- Publication year
2025
- Venue
Conference on Empirical Methods in Natural Language Processing
- Publication date
Unknown publication date
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-23 of 23 references · Page 1 of 1
CITED BY
Showing 1-2 of 2 citing papers · Page 1 of 1