On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions

Published 2026 in Unknown venue

ABSTRACT

Transformer networks have achieved remarkable empirical success across a wide range of applications, yet their theoretical expressive power remains insufficiently understood. In this paper, we study the expressive capabilities of Transformer architectures. We first establish an explicit approximation of maxout networks by Transformer networks while preserving comparable model complexity. As a consequence, Transformers inherit the universal approximation capability of ReLU networks under similar complexity constraints. Building on this connection, we develop a framework to analyze the approximation of continuous piecewise linear functions by Transformers and quantitatively characterize their expressivity via the number of linear regions, which grows exponentially with depth. Our analysis establishes a theoretical bridge between approximation theory for standard feedforward neural networks and Transformer architectures. It also yields structural insights into Transformers: self-attention layers implement max-type operations, while feedforward layers realize token-wise affine transformations.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-03-03
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 2603.03084
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective
2025cited by this paper
Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling
2024cited by this paper
On the Number of Linear Regions of Convolutional Neural Networks With Piecewise Linear Activations
2024cited by this paper
Transformers and large language models in healthcare: A review
2023cited by this paper
Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input
2023cited by this paper
Approximation Rate of the Transformer Architecture for Sequence Modeling
2023cited by this paper
Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?
2023cited by this paper
How Smooth Is Attention?
2023cited by this paper
Improved Bounds on Neural Complexity for Representing Piecewise Linear Functions
2022cited by this paper
Elementary superexpressive activations
2021cited by this paper
Can Vision Transformers Perform Convolution?
2021cited by this paper
Towards Lower Bounds on the Depth of ReLU Neural Networks
2021cited by this paper
Sharp bounds for the number of regions of maxout networks and vertices of Minkowski sums
2021influential reference
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
2021cited by this paper
Optimal Approximation Rate of ReLU Networks in terms of Width and Depth
2021cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
Big Bird: Transformers for Longer Sequences
2020cited by this paper
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers
2020cited by this paper
The Lipschitz Constant of Self-Attention
2020cited by this paper
On the Number of Linear Regions of Convolutional Neural Networks
2020cited by this paper
Deep Network Approximation for Smooth Functions
2020cited by this paper
Low-Rank Bottleneck in Multi-head Attention Models
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Approximation rates for neural networks with general activation functions
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
Deep Network Approximation Characterized by Number of Neurons
2019cited by this paper
On the Relationship between Self-Attention and Convolutional Layers
2019cited by this paper
Are Transformers universal approximators of sequence-to-sequence functions?
2019influential reference
Improving Language Understanding by Generative Pre-Training
2018cited by this paper
A Framework for the Construction of Upper Bounds on the Number of Affine Linear Regions of ReLU Feed-Forward Neural Networks
2018cited by this paper
Bounding and Counting Linear Regions of Deep Neural Networks
2017cited by this paper
Attention is All you Need
2017cited by this paper
Error bounds for approximations with deep ReLU networks
2016cited by this paper
Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls
2016cited by this paper
Benefits of Depth in Neural Networks
2016cited by this paper
Understanding Deep Neural Networks with Rectified Linear Units
2016cited by this paper
Near-optimal max-affine estimators for convex regression
2015cited by this paper
On the Number of Linear Regions of Deep Neural Networks
2014cited by this paper
Maxout Networks
2013cited by this paper
On the number of response regions of deep feed forward networks with piece-wise linear activations
2013cited by this paper
Approximation capabilities of multilayer feedforward networks
1991cited by this paper
Approximation by superpositions of a sigmoidal function
1989cited by this paper
On the approximate realization of continuous mappings by neural networks
1989cited by this paper

CITED BY

No citing papers are available for this paper.