MambaByte: Token-free Selective State Space Model

Junxiong Wang,Tushaar Gangavarapu,J. Yan,Alexander M. Rush

Published 2024 in arXiv.org

ABSTRACT

Token-free language models learn directly from raw bytes and remove the inductive bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences. In this setting, standard autoregressive Transformers scale poorly as the effective memory required grows with sequence length. The recent development of the Mamba state space model (SSM) offers an appealing alternative approach with a fixed-sized memory state and efficient decoding. We propose MambaByte, a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences. In terms of modeling, we show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks while maintaining the benefits of token-free language models, such as robustness to noise. In terms of efficiency, we develop an adaptation of speculative decoding with tokenized drafting and byte-level verification. This results in a $2.6\times$ inference speedup to the standard MambaByte implementation, showing similar decoding efficiency as the subword Mamba. These findings establish the viability of SSMs in enabling token-free language modeling.

PUBLICATION RECORD

Publication year
2024
Venue
arXiv.org
Publication date
2024-01-24
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2401.13660 arXiv 2401.13660
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Speculative Streaming: Fast LLM Inference without Auxiliary Models
2024cited by this paper
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
2024cited by this paper
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling
2024cited by this paper
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
2024cited by this paper
Simple linear attention language models balance the recall-throughput tradeoff
2024cited by this paper
Diffusion Models Without Attention
2023cited by this paper
REST: Retrieval-Based Speculative Decoding
2023cited by this paper
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
2023cited by this paper
Online Speculative Decoding
2023cited by this paper
Accelerating LLM Inference with Staged Speculative Decoding
2023cited by this paper
Focused Transformer: Contrastive Training for Context Scaling
2023cited by this paper
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
2023influential reference
Inference with Reference: Lossless Acceleration of Large Language Models
2023cited by this paper
Accelerating Large Language Model Decoding with Speculative Sampling
2023cited by this paper
Resurrecting Recurrent Neural Networks for Long Sequences
2023cited by this paper
Cascade Speculative Drafting for Even Faster LLM Inference
2023cited by this paper
Gated Linear Attention Transformers with Hardware-Efficient Training
2023cited by this paper
Zoology: Measuring and Improving Recall in Efficient Language Models
2023cited by this paper
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
2023influential reference
Pretraining Without Attention
2022cited by this paper
General-purpose, long-context autoregressive modeling with Perceiver AR
2022cited by this paper
It's Raw! Audio Generation with State-Space Models
2022influential reference
Block-Recurrent Transformers
2022cited by this paper
Diagonal State Spaces are as Effective as Structured State Spaces
2022cited by this paper
OPT: Open Pre-trained Transformer Language Models
2022cited by this paper
On the Parameterization and Initialization of Diagonal State Space Models
2022cited by this paper
How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections
2022cited by this paper
Long Range Language Modeling via Gated State Spaces
2022cited by this paper
Simplified State Space Layers for Sequence Modeling
2022cited by this paper
Fast Inference from Transformers via Speculative Decoding
2022cited by this paper
ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models
2022cited by this paper
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
2022cited by this paper
S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces
2022cited by this paper
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation
2022cited by this paper
Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
2021cited by this paper
Efficiently Modeling Long Sequences with Structured State Spaces
2021influential reference
Shortformer: Better Language Modeling using Shorter Inputs
2021cited by this paper
Hierarchical Transformers Are More Efficient Language Models
2021cited by this paper
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
2021cited by this paper
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models
2021influential reference
RoFormer: Enhanced Transformer with Rotary Position Embedding
2021influential reference
Efficient Content-Based Sparse Attention with Routing Transformers
2020cited by this paper
CharBERT: Character-aware Pre-trained Language Model
2020cited by this paper
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
2020cited by this paper
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
Character-Level Translation with Self-attention
2020cited by this paper
The Curious Case of Neural Text Degeneration
2019influential reference
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Compressive Transformers for Long-Range Sequence Modelling
2019influential reference
Neural Machine Translation with Byte-Level Subwords
2019cited by this paper
Bridging the Gap for Tokenizer-Free Language Models
2019cited by this paper
Searching for Activation Functions
2018cited by this paper
A Simple Method for Commonsense Reasoning
2018cited by this paper
Character-Level Language Modeling with Deeper Self-Attention
2018cited by this paper
(Preprint)
2018cited by this paper
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
2018cited by this paper
Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model
2018cited by this paper
Attention is All you Need
2017cited by this paper
Gaussian Error Linear Units (GELUs)
2016cited by this paper
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
2016cited by this paper
Neural Machine Translation of Rare Words with Subword Units
2015cited by this paper
Japanese and Korean voice search
2012cited by this paper
A Neural Probabilistic Language Model
2003cited by this paper

CITED BY

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
2026cites this paper
You Can Learn Tokenization End-to-End with Reinforcement Learning
2026cites this paper
TH-Mamba: Spatial-Temporal Correlation Learning for Mamba-Based Talking Head Generation
2026cites this paper
Unified Packet Compression and Model Adaptation for Integrated Sensing and Multi-Modal Communications
2026cites this paper
MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance
2026cites this paper
Proxy Compression for Language Modeling
2026cites this paper
Rank-Based Modeling for Universal Packets Compression in Multi-Modal Communications
2025cites this paper
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
2025cites this paper
CivicMorph: Generative Modeling for Public Space Form Development
2025cites this paper
Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking
2025cites this paper
MambaStyle: Efficient StyleGAN Inversion for Real Image Editing with State-Space Models
2025cites this paper
Sampling from Your Language Model One Byte at a Time
2025influential citation
Memba: Membrane-driven Parameter-Efficient Fine-Tuning for Mamba
2025cites this paper
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
2025cites this paper
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
2025cites this paper
Differential Mamba
2025cites this paper
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
2025cites this paper
A Comprehensive Survey on Mamba: Architectures, Challenges, and Opportunities
2025cites this paper
CompletionMamba: Taming State Space Model for Point Cloud Completion
2025cites this paper
Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs
2025cites this paper
S2M2ECG: Spatio-temporal bi-directional State Space Model Enabled Multi-branch Mamba for ECG
2025cites this paper
Bolmo: Byteifying the Next Generation of Language Models
2025cites this paper
Towards the Machine Translation of Scientific Neologisms
2025cites this paper
Scaling Embedding Layers in Language Models
2025cites this paper
Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models
2025cites this paper
Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence
2025cites this paper
Position: Prospective of Autonomous Driving - Multimodal LLMs, World Models, Embodied Intelligence, AI Alignment, and Mamba
2025cites this paper
Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
2024cites this paper
Selective Attention: Enhancing Transformer through Principled Context Control
2024cites this paper
Mamba Models a possible replacement for Transformers?
2024cites this paper
PointMamba: A Simple State Space Model for Point Cloud Analysis
2024cites this paper
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
2024cites this paper
The Hidden Attention of Mamba Models
2024cites this paper
SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces
2024cites this paper
ZigMa: A DiT-style Zigzag Mamba Diffusion Model
2024cites this paper
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
2024cites this paper
SpaceByte: Towards Deleting Tokenization from Large Language Modeling
2024influential citation
Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges
2024cites this paper
Matten: Video Generation with Mamba-Attention
2024cites this paper
PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning
2024cites this paper
Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks
2024cites this paper
Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation
2024cites this paper
MambaLRP: Explaining Selective State Space Sequence Models
2024cites this paper
DeciMamba: Exploring the Length Extrapolation Potential of Mamba
2024cites this paper
Venturing into Uncharted Waters: The Navigation Compass from Transformer to Mamba
2024cites this paper
Exploring the Capability of Mamba in Speech Applications
2024cites this paper
MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
2024cites this paper
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
2024cites this paper
MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders
2024cites this paper
Mamba-ST: State Space Model for Efficient Style Transfer
2024cites this paper
PixelBytes: Catching Unified Embedding for Multimodal Generation
2024cites this paper
PixelBytes: Catching Unified Representation for Multimodal Generation
2024influential citation
Quamba: A Post-Training Quantization Recipe for Selective State Space Models
2024cites this paper
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
2024cites this paper
Exploring contextual modeling with linear complexity for point cloud segmentation
2024cites this paper
Mamba-Based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition
2024cites this paper
Gated Linear Attention Transformers with Hardware-Efficient Training
2023cites this paper
Theoretical Analysis of the Selection Mechanism in Mamba: Training Dynamics and Generalization
year unknowncites this paper
Multi-stream Sequence Learning
year unknowncites this paper
3DET-Mamba: State Space Model for End-to-End 3D Object Detection
year unknowncites this paper
S EQUENCE L EARNING FROM C ONTINUOUS S TREAMS OF D ATA
year unknowncites this paper