This paper studies the limits of language models' statistical learning in the context of Zipf's law. First, we demonstrate that Zipf-law token distribution emerges irrespective of the chosen tokenization. Second, we show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics. Namely, the tokens that have a one-to-one correspondence with one semantic concept have different statistical properties than those with semantic ambiguity. Finally, we demonstrate how these properties interfere with statistical learning procedures motivated by distributional semantics.
Pragmatic Constraint on Distributional Semantics
Elizaveta Zhemchuzhina,N. Filippov,Ivan P. Yamshchikov
Published 2022 in arXiv.org
ABSTRACT
PUBLICATION RECORD
- Publication year
2022
- Venue
arXiv.org
- Publication date
2022-11-20
- Fields of study
Mathematics, Linguistics, Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-23 of 23 references · Page 1 of 1
CITED BY
Showing 1-2 of 2 citing papers · Page 1 of 1