Scaling Embedding Layers in Language Models

Da Yu,Edith Cohen,Badih Ghazi,Yangsibo Huang,Pritish Kamath,Ravi Kumar,Daogao Liu,Chiyuan Zhang

Published 2025 in arXiv.org

ABSTRACT

We propose $SCONE$ ($S$calable, $C$ontextualized, $O$ffloaded, $N$-gram $E$mbedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, $SCONE$ retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. $SCONE$ enables two new scaling strategies: increasing the number of n-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-02-03
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2502.01637 arXiv 2502.01637
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Mixture of Lookup Experts
2025cited by this paper
Qwen3 Technical Report
2025influential reference
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
2025cited by this paper
Compression Represents Intelligence Linearly
2024cited by this paper
Memory Layers at Scale
2024cited by this paper
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
2024cited by this paper
Mixture of A Million Experts
2024cited by this paper
Gemma 2: Improving Open Language Models at a Practical Size
2024cited by this paper
T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
2024cited by this paper
Nemotron-4 340B Technical Report
2024cited by this paper
DeepSeek-V3 Technical Report
2024cited by this paper
Sailor: Open Language Models for South-East Asia
2024cited by this paper
Getting the most out of your tokenizer for pre-training and domain adaptation
2024cited by this paper
OLMo: Accelerating the Science of Language Models
2024cited by this paper
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
2024cited by this paper
MambaByte: Token-free Selective State Space Model
2024cited by this paper
Byte Latent Transformer: Patches Scale Better Than Tokens
2024cited by this paper
Large Concept Models: Language Modeling in a Sentence Representation Space
2024cited by this paper
Jet Expansions of Residual Computation
2024cited by this paper
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
2024cited by this paper
The Llama 3 Herd of Models
2024cited by this paper
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
2023cited by this paper
Efficient Memory Management for Large Language Model Serving with PagedAttention
2023cited by this paper
Neurons in Large Language Models: Dead, N-gram, Positional
2023cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023cited by this paper
Improving Language Plasticity via Pretraining with Active Forgetting
2023cited by this paper
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
2023cited by this paper
XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models
2023cited by this paper
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
2022cited by this paper
Training Compute-Optimal Large Language Models
2022cited by this paper
N-Grammer: Augmenting Transformers with latent n-grams
2022cited by this paper
Formal Algorithms for Transformers
2022cited by this paper
Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training
2021cited by this paper
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021cited by this paper
Learnable Embedding Sizes for Recommender Systems
2021cited by this paper
Scaling Scaling Laws with Board Games
2021cited by this paper
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models
2021cited by this paper
Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese
2021cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
2020cited by this paper
Transformer Feed-Forward Layers Are Key-Value Memories
2020cited by this paper
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
2020cited by this paper
Large Memory Layers with Product Keys
2019cited by this paper
λOpt: Learn to Regularize Recommender Models in Finer Levels
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019influential reference
Better Word Embeddings by Disentangling Contextual n-Gram Information
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
On the Cross-lingual Transferability of Monolingual Representations
2019cited by this paper
Improving Pre-Trained Multilingual Model with Vocabulary Expansion
2019cited by this paper
Bridging the Gap for Tokenizer-Free Language Models
2019cited by this paper
Deep Contextualized Word Representations
2018cited by this paper
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
2018cited by this paper
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
2018cited by this paper
Billion-Scale Similarity Search with GPUs
2017cited by this paper
Decoupled Weight Decay Regularization
2017influential reference
SVD-Softmax: Fast Softmax Approximation on Large Vocabulary Neural Networks
2017cited by this paper
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
2017cited by this paper
Learned in Translation: Contextualized Word Vectors
2017cited by this paper
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
2016cited by this paper
Efficient softmax approximation for GPUs
2016cited by this paper
Pointer Sentinel Mixture Models
2016cited by this paper
Neural Machine Translation of Rare Words with Subword Units
2015influential reference
End-To-End Memory Networks
2015cited by this paper
Character-Aware Neural Language Models
2015cited by this paper
Memory Networks
2014cited by this paper
Efficient Estimation of Word Representations in Vector Space
2013cited by this paper
A new algorithm for data compression
1994cited by this paper

CITED BY

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
2026cites this paper
L3: Large Lookup Layers
2026cites this paper
JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation
2026cites this paper
Proxy Compression for Language Modeling
2026cites this paper
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
2025cites this paper
Lossless Token Sequence Compression via Meta-Tokens
2025cites this paper
Mixture of Lookup Key-Value Experts
2025cites this paper
Local Large Language Models for Recommendation
2025cites this paper
Dynamic Injection of Entity Knowledge into Dense Retrievers
2025cites this paper
VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models
2025cites this paper
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
2025cites this paper