Why are Sensitive Functions Hard for Transformers?

Published 2024 in Annual Meeting of the Association for Computational Linguistics

ABSTRACT

Empirical studies have identified a range of learnability biases and limitations of transformers, such as a persistent difficulty in learning to compute simple formal languages such as PARITY, and a bias towards low-degree functions. However, theoretical understanding remains limited, with existing expressiveness theory either overpredicting or underpredicting realistic learning abilities. We prove that, under the transformer architecture, the loss landscape is constrained by the input-space sensitivity: Transformers whose output is sensitive to many parts of the input string inhabit isolated points in parameter space, leading to a low-sensitivity bias in generalization. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers, such as their generalization bias towards low sensitivity and low degree, and difficulty in length generalization for PARITY. This shows that understanding transformers' inductive biases requires studying not just their in-principle expressivity, but also their loss landscape.

PUBLICATION RECORD

Publication year
2024
Venue
Annual Meeting of the Association for Computational Linguistics
Publication date
2024-02-15
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2402.09963 arXiv 2402.09963
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

The Expressive Power of Transformers with Chain of Thought
2024cited by this paper
Representational Strengths and Limitations of Transformers
2023cited by this paper
A modern look at the relationship between sharpness and generalization
2023cited by this paper
Randomized Positional Encodings Boost Length Generalization of Transformers
2023cited by this paper
Towards Revealing the Mystery behind Chain of Thought: a Theoretical Perspective
2023cited by this paper
Average-Hard Attention Transformers are Constant-Depth Uniform Threshold Circuits
2023influential reference
Transformers as Algorithms: Generalization and Stability in In-context Learning
2023influential reference
Generalization on the Unseen, Logic Reasoning and Degree Curriculum
2023influential reference
Tighter Bounds on the Expressivity of Transformer Encoders
2023cited by this paper
What Algorithms can Transformers Learn? A Study in Length Generalization
2023cited by this paper
Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability
2022cited by this paper
On the Maximum Hessian Eigenvalue and Generalization
2022cited by this paper
Overcoming a Theoretical Limitation of Self-Attention
2022influential reference
On Layer Normalizations and Residual Connections in Transformers
2022cited by this paper
Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit Complexity
2022cited by this paper
A Logic for Expressing Log-Precision Transformers
2022cited by this paper
Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions
2022influential reference
The Parallelism Tradeoff: Limitations of Log-Precision Transformers
2022cited by this paper
How Does Sharpness-Aware Minimization Minimize Sharpness?
2022cited by this paper
Inductive Biases and Variable Creation in Self-Attention Mechanisms
2021influential reference
Sensitivity as a Complexity Measure for Sequence Classification Tasks
2021cited by this paper
Self-Attention Networks Can Process Bounded Hierarchical Languages
2021cited by this paper
Thinking Like Transformers
2021influential reference
Spectral Bias in Practice: The Role of Function Frequency in Generalization
2021cited by this paper
Saturated Transformers are Constant-Depth Threshold Circuits
2021cited by this paper
Volume 35
2021cited by this paper
On the Ability and Limitations of Transformers to Recognize Formal Languages
2020influential reference
Sharpness-Aware Minimization for Efficiently Improving Generalization
2020cited by this paper
Are Transformers universal approximators of sequence-to-sequence functions?
2019cited by this paper
Fantastic Generalization Measures and Where to Find Them
2019cited by this paper
PyTorch: An Imperative Style, High-Performance Deep Learning Library
2019cited by this paper
Theoretical Limitations of Self-Attention in Neural Sequence Models
2019influential reference
On the Spectral Bias of Neural Networks
2018cited by this paper
Decoupled Weight Decay Regularization
2017cited by this paper
Attention is All you Need
2017cited by this paper
Layer Normalization
2016cited by this paper
Boolean Function Complexity Advances and Frontiers
2012cited by this paper
Analysis of Boolean Functions
2012influential reference
Concise Formulas for the Area and Volume of a Hyperspherical Cap
2011cited by this paper
Variations on the Sensitivity Conjecture
2010cited by this paper
A Brief Introduction to Fourier Analysis on the Boolean Cube
2008cited by this paper
The influence of variables on Boolean functions
1988cited by this paper
Computational limitations of small-depth circuits
1987cited by this paper

CITED BY

Context-Free Recognition with Transformers
2026cites this paper
The Expressive Limits of Diagonal SSMs for State-Tracking
2026cites this paper
Barriers to Discrete Reasoning with Transformers: A Survey Across Depth, Exactness, and Bandwidth
2026cites this paper
Unlearnable phases of matter
2026cites this paper
Noise Stability of Transformer Models
2026cites this paper
Trapped by simplicity: When Transformers fail to learn from noisy features
2026cites this paper
Parity, Sensitivity, and Transformers
2026cites this paper
On the Spatiotemporal Dynamics of Generalization in Neural Networks
2026influential citation
No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs
2026cites this paper
Rethinking Memorization Measures and their Implications in Large Language Models
2025cites this paper
Lost in Transmission: When and Why LLMs Fail to Reason Globally
2025cites this paper
The Counting Power of Transformers
2025cites this paper
Continuity and Isolation Lead to Doubts or Dilemmas in Large Language Models
2025cites this paper
ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs
2025cites this paper
Minimalist Softmax Attention Provably Learns Constrained Boolean Functions
2025cites this paper
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities
2025cites this paper
Learning Moderately Input-Sensitive Functions: A Case Study in QR Code Decoding
2025cites this paper
Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic
2025cites this paper
Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs
2025cites this paper
Is In-Context Learning Learning?
2025cites this paper
Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding
2025cites this paper
The Transformer Cookbook
2025cites this paper
On the Limitations and Capabilities of Position Embeddings for Length Generalization
2025cites this paper
On the Reasoning Abilities of Masked Diffusion Language Models
2025cites this paper
Benefits and Limitations of Communication in Multi-Agent Reasoning
2025cites this paper
How do autoregressive transformers solve full addition?
2025cites this paper
Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently
2025cites this paper
Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers
2025influential citation
Emergent Stack Representations in Modeling Counter Languages Using Transformers
2025cites this paper
Provably Overwhelming Transformer Models with Designed Inputs
2025cites this paper
You Do Not Fully Utilize Transformer's Representation Capacity
2025cites this paper
The Role of Sparsity for Length Generalization in Transformers
2025cites this paper
Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics
2025cites this paper
Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More
2025influential citation
Unique Hard Attention: A Tale of Two Sides
2025cites this paper
Geometric Generality of Transformer-Based Gröbner Basis Computation
2025cites this paper
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
2025cites this paper
On the Pathological Path-star Task for Language Models (Extended Abstract)
2024cites this paper
Transformers Learn Low Sensitivity Functions: Investigations and Implications
2024cites this paper
The Mystery of the Pathological Path-star Task for Language Models
2024cites this paper
From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
2024influential citation
Extracting Moore Machines from Transformers Using Queries and Counterexamples
2024cites this paper
Training the Untrainable: Introducing Inductive Bias via Representational Alignment
2024cites this paper
Training Neural Networks as Recognizers of Formal Languages
2024influential citation
Machine Learning meets Algebraic Combinatorics: A Suite of Benchmark Datasets to Accelerate AI for Mathematics Research
2024cites this paper
Reorganizing attention-space geometry with expressive attention
2024cites this paper
How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad
2024cites this paper
The Expressive Capacity of State Space Models: A Formal Language Perspective
2024cites this paper
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
2024cites this paper
T RAINING THE U NTRAINABLE : I NTRODUCING I NDUC - TIVE B IAS VIA R EPRESENTATIONAL A LIGNMENT
year unknowncites this paper
The Acquisition Process
year unknowncites this paper
E XCESSIVE S UPERVISION AND S HORTCUTS P REVENT I N - DOMAIN L EARNING OF T RIVIAL G RAPH S EARCH
year unknowncites this paper
Rethinking Memorization Measures in LLMs: Recollection vs. Counterfactual vs. Contextual Memorization
year unknowncites this paper
Machine Learning meets Algebraic Combinatorics: A Suite of Datasets to Accelerate AI for Mathematics Research
year unknowncites this paper