Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

Gautam Goel,Mahdi Soltanolkotabi,Peter L. Bartlett

Published 2026 in Unknown venue

ABSTRACT

We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel"structure-aware"variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over standard gradient descent, including a preconditioner and regularizer which help avoid spurious stationary points, and a data-dependent spectral initialization of parameters which lie near the manifold of global minima with high probability.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-03-02
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 2603.01514
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

SOAP: Improving and Stabilizing Shampoo using Adam
2024cited by this paper
Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality
2024cited by this paper
GPT-4 Technical Report
2023cited by this paper
Linear attention is (maybe) all you need (to understand transformer optimization)
2023cited by this paper
Transformers learn to implement preconditioned gradient descent for in-context learning
2023cited by this paper
Trained Transformers Learn Linear Models In-Context
2023influential reference
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection
2023cited by this paper
Max-Margin Token Selection in Attention Mechanism
2023cited by this paper
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
2023cited by this paper
Transformers learn in-context by gradient descent
2022cited by this paper
Formal Algorithms for Transformers
2022cited by this paper
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
2022cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Shampoo: Preconditioned Stochastic Tensor Optimization
2018cited by this paper
Attention is All you Need
2017cited by this paper
Low-rank Solutions of Linear Matrix Equations via Procrustes Flow
2015influential reference
Neural Machine Translation by Jointly Learning to Align and Translate
2014influential reference
Non-asymptotic theory of random matrices: extreme singular values
2010cited by this paper
Concentration Inequalities
2008cited by this paper

CITED BY

No citing papers are available for this paper.