Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Hongzhi Huang,Defa Zhu,Banggu Wu,Yutao Zeng,Ya Wang,Qiyang Min,Xun Zhou
Published 2025 in International Conference on Machine Learning
ABSTRACT
PUBLICATION RECORD
- Publication year
2025
- Venue
International Conference on Machine Learning
- Publication date
2025-01-28
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-25 of 25 references · Page 1 of 1
CITED BY
Showing 1-31 of 31 citing papers · Page 1 of 1