Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Jeongin Bae,Baeseong Park,Gunho Park,Minsub Kim,Joonhyung Lee,Junhee Yoo,Sunghyeon Woo,Jiwon Ryu,S. Kwon,Dongsoo Lee

Published 2026 in Unknown venue

ABSTRACT

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-02-26
Fields of study
Computer Science
Identifiers
arXiv 2602.23057
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
2025influential reference
gpt-oss-120b&gpt-oss-20b Model Card
2025influential reference
What are you sinking? A geometric approach on attention sink
2025cited by this paper
On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective
2025cited by this paper
In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness
2024cited by this paper
Bridging the Divide: Reconsidering Softmax and Linear Attention
2024cited by this paper
When Attention Sink Emerges in Language Models: An Empirical View
2024cited by this paper
Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling
2024cited by this paper
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
2023cited by this paper
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
2023cited by this paper
Approximation Rate of the Transformer Architecture for Sequence Modeling
2023cited by this paper
Efficient Streaming Language Models with Attention Sinks
2023cited by this paper
Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention
2023cited by this paper
Normalized Attention Without Probability Cage
2020cited by this paper
HellaSwag: Can a Machine Really Finish Your Sentence?
2019cited by this paper
PIQA: Reasoning about Physical Commonsense in Natural Language
2019cited by this paper
An Adversarial Winograd Schema Challenge at Scale
2019cited by this paper
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
2019cited by this paper
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
2018cited by this paper
Attention is All you Need
2017cited by this paper
Effective Approaches to Attention-based Neural Machine Translation
2015cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper

CITED BY

No citing papers are available for this paper.