Infinite Self-Attention

Published 2026 in Unknown venue

ABSTRACT

The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-02-26
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 2603.00175
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Attention (as Discrete-Time Markov) Chains
2025cited by this paper
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
2024cited by this paper
Demystify Mamba in Vision: A Linear Attention Perspective
2024cited by this paper
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
2024cited by this paper
HyenaPixel: Global Image Context with Convolutions
2024cited by this paper
Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis
2024cited by this paper
Breaking the Low-Rank Dilemma of Linear Attention
2024cited by this paper
ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
2024cited by this paper
SG-Former: Self-guided Transformer with Evolving Token Reallocation
2023cited by this paper
RMT: Retentive Networks Meet Vision Transformers
2023cited by this paper
FLatten Transformer: Vision Transformer using Focused Linear Attention
2023cited by this paper
Scale-Aware Modulation Meet Transformer
2023cited by this paper
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
2023cited by this paper
RWKV: Reinventing RNNs for the Transformer Era
2023cited by this paper
Agent Attention: On the Integration of Softmax and Linear Attention
2023cited by this paper
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
2023cited by this paper
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
2022cited by this paper
Vision Transformer with Super Token Sampling
2022cited by this paper
Global Context Vision Transformers
2022cited by this paper
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions
2022cited by this paper
Fastformer: Additive Attention Can Be All You Need
2021influential reference
SOFT: Softmax-free Transformer with Linear Complexity
2021cited by this paper
Swin Transformer V2: Scaling Up Capacity and Resolution
2021cited by this paper
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
2021cited by this paper
Linformer: Self-Attention with Linear Complexity
2020cited by this paper
Longformer: The Long-Document Transformer
2020cited by this paper
Quantifying Attention Flow in Transformers
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
Infinite Feature Selection: A Graph-based Feature Filtering Approach
2020influential reference
Rethinking Attention with Performers
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020influential reference
Transformer Interpretability Beyond Attention Visualization
2020cited by this paper
Training data-efficient image transformers & distillation through attention
2020cited by this paper
Graph Neural Networks Exponentially Lose Expressive Power for Node Classification
2019cited by this paper
Do ImageNet Classifiers Generalize to ImageNet?
2019cited by this paper
Energy and Policy Considerations for Deep Learning in NLP
2019cited by this paper
Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning
2018cited by this paper
Non-local Neural Networks
2017cited by this paper
Interpretable Explanations of Black Boxes by Meaningful Perturbation
2017cited by this paper
Squeeze-and-Excitation Networks
2017cited by this paper
Attention is All you Need
2017cited by this paper
Graph Signal Processing: Overview, Challenges, and Applications
2017cited by this paper
Feature Selection via Eigenvector Centrality
2016cited by this paper
Ranking to Learn: - Feature Ranking and Selection via Eigenvector Centrality
2016cited by this paper
Infinite Feature Selection
2015influential reference
Evaluating the Visualization of What a Deep Neural Network Has Learned
2015cited by this paper
Nonlinear Perron-Frobenius Theory
2012cited by this paper
Google’s pagerank and beyond: The science of search engine rankings
2008cited by this paper
Non-negative Matrices and Markov Chains
2008cited by this paper
FINITE MARKOV CHAINS
2005influential reference
Inside PageRank
2005cited by this paper
A non-local algorithm for image denoising
2005cited by this paper
Eigenvector-like measures of centrality for asymmetric relations
2001cited by this paper
The PageRank Citation Ranking : Bringing Order to the Web
1999cited by this paper
Concrete mathematics - a foundation for computer science
1991cited by this paper
Centrality in social networks conceptual clarification
1978cited by this paper
International
1964cited by this paper
Extensions of Jentzsch’s theorem
1957cited by this paper
A new status index derived from sociometric analysis
1953influential reference

CITED BY

No citing papers are available for this paper.