2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

Published 2026 in Unknown venue

ABSTRACT

Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-02-19
Fields of study
Computer Science
Identifiers
arXiv 2602.17363
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
2025influential reference
Forgetting Transformer: Softmax Attention with a Forget Gate
2025cited by this paper
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
2024influential reference
Gated Delta Networks: Improving Mamba2 with Delta Rule
2024cited by this paper
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
2024cited by this paper
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
2024influential reference
SlimPajama-DC: Understanding Data Combinations for LLM Training
2023cited by this paper
Gated Linear Attention Transformers with Hardware-Efficient Training
2023cited by this paper
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
2023cited by this paper
Retentive Network: A Successor to Transformer for Large Language Models
2023cited by this paper
Hierarchical Text-Conditional Image Generation with CLIP Latents
2022cited by this paper
cosFormer: Rethinking Softmax in Attention
2022cited by this paper
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
2022influential reference
Robust Speech Recognition via Large-Scale Weak Supervision
2022cited by this paper
RT-1: Robotics Transformer for Real-World Control at Scale
2022cited by this paper
Decision Transformer: Reinforcement Learning via Sequence Modeling
2021cited by this paper
Efficiently Modeling Long Sequences with Structured State Spaces
2021cited by this paper
Highly accurate protein structure prediction with AlphaFold
2021cited by this paper
Linear Transformers Are Secretly Fast Weight Programmers
2021cited by this paper
Rethinking Attention with Performers
2020cited by this paper
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
2020cited by this paper
Online normalizer calculation for softmax
2018cited by this paper
Attention is All you Need
2017influential reference
Neural Machine Translation by Jointly Learning to Align and Translate
2014influential reference
A New Approach to Linear Filtering and Prediction Problems
2002cited by this paper
Learning representations by back-propagating errors
1986influential reference

CITED BY

No citing papers are available for this paper.