Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Hongzhi Huang,Defa Zhu,Banggu Wu,Yutao Zeng,Ya Wang,Qiyang Min,Xun Zhou

Published 2025 in International Conference on Machine Learning

ABSTRACT

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

PUBLICATION RECORD

Publication year
2025
Venue
International Conference on Machine Learning
Publication date
2025-01-28
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2501.16975 arXiv 2501.16975
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
2024cited by this paper
Byte Latent Transformer: Patches Scale Better Than Tokens
2024cited by this paper
OLMoE: Open Mixture-of-Experts Language Models
2024cited by this paper
Better & Faster Large Language Models via Multi-token Prediction
2024influential reference
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
2024cited by this paper
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
2023influential reference
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
2023cited by this paper
A Frequency-aware Software Cache for Large Recommendation System Embeddings
2022cited by this paper
N-Grammer: Augmenting Transformers with latent n-grams
2022cited by this paper
Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
2021cited by this paper
Lookup-Table Recurrent Language Models for Long Tail Speech Recognition
2021cited by this paper
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models
2021cited by this paper
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
2021cited by this paper
Measuring Massive Multitask Language Understanding
2020cited by this paper
Scaling Laws for Neural Language Models
2020cited by this paper
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training
2020cited by this paper
HellaSwag: Can a Machine Really Finish Your Sentence?
2019cited by this paper
PIQA: Reasoning about Physical Commonsense in Natural Language
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
2018cited by this paper
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
2018cited by this paper
Blockwise Parallel Decoding for Deep Autoregressive Models
2018influential reference
Attention is All you Need
2017cited by this paper
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
2017cited by this paper
Neural Machine Translation of Rare Words with Subword Units
2015cited by this paper

CITED BY

Training Language Models with homotokens Leads to Delayed Overfitting
2026cites this paper
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
2026influential citation
Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models
2026cites this paper
Scaling Embeddings Outperforms Scaling Experts in Language Models
2026influential citation
JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation
2026cites this paper
Proxy Compression for Language Modeling
2026cites this paper
LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers
2026cites this paper
Speculative Decoding with a Speculative Vocabulary
2026cites this paper
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
2025cites this paper
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
2025cites this paper
XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding
2025cites this paper
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
2025influential citation
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
2025cites this paper
Membership Inference Attacks on Tokenizers of Large Language Models
2025influential citation
Vocabulary embeddings organize linguistic structure early in language model training
2025cites this paper
Scaling Embedding Layers in Language Models
2025cites this paper
Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic
2025cites this paper
Virtual Width Networks
2025influential citation
Small Vocabularies, Big Gains: Pretraining and Tokenization in Time Series Models
2025cites this paper
Mixture of Lookup Key-Value Experts
2025cites this paper
Scaling Behavior of Discrete Diffusion Language Models
2025cites this paper
Bolmo: Byteifying the Next Generation of Language Models
2025cites this paper
TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
2025cites this paper
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
2025cites this paper
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models
2025cites this paper
Frac-Connections: Fractional Extension of Hyper-Connections
2025cites this paper
SuperBPE: Space Travel for Language Models
2025cites this paper
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
2025cites this paper
The Rise of Small Language Models in Healthcare: A Comprehensive Survey
2025influential citation
The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models
2025cites this paper
Token Distillation: Attention-aware Input Embeddings For New Tokens
2025cites this paper