Are Sixteen Heads Really Better than One?

Published 2019 in Neural Information Processing Systems

ABSTRACT

Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art NLP models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention "head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom. Finally, we analyze the results with respect to which parts of the model are more reliant on having multiple heads, and provide precursory evidence that training dynamics play a role in the gains provided by multi-head attention.

PUBLICATION RECORD

Publication year
2019
Venue
Neural Information Processing Systems
Publication date
2019-05-01
Fields of study
Computer Science
Identifiers
arXiv 1905.10650
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
2020influential reference
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
compare-mt: A Tool for Holistic Comparison of Language Generation Systems
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
2019cited by this paper
An Analysis of Encoder Representations in Transformer-Based Machine Translation
2018cited by this paper
Linguistically-Informed Self-Attention for Semantic Role Labeling
2018cited by this paper
Neural Network Acceptability Judgments
2018cited by this paper
Scaling Neural Machine Translation
2018influential reference
Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures
2018cited by this paper
MTNT: A Testbed for Machine Translation of Noisy Text
2018cited by this paper
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
2017cited by this paper
Opening the Black Box of Deep Neural Networks via Information
2017cited by this paper
Attention is All you Need
2017influential reference
Weighted Transformer Network for Machine Translation
2017cited by this paper
A Deep Reinforced Model for Abstractive Summarization
2017cited by this paper
DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding
2017cited by this paper
Pruning Convolutional Neural Networks for Resource Efficient Inference
2016influential reference
Sequence-Level Knowledge Distillation
2016cited by this paper
A Decomposable Attention Model for Natural Language Inference
2016cited by this paper
Pruning Filters for Efficient ConvNets
2016cited by this paper
Compression of Neural Machine Translation Models via Pruning
2016cited by this paper
Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers
2016cited by this paper
Long Short-Term Memory-Networks for Machine Reading
2016cited by this paper
Learning both Weights and Connections for Efficient Neural Network
2015cited by this paper
Auto-Sizing Neural Networks: With Applications to n-gram Language Models
2015cited by this paper
Effective Approaches to Attention-based Neural Machine Translation
2015cited by this paper
Structured Pruning of Deep Convolutional Neural Networks
2015cited by this paper
Report on the 11th IWSLT evaluation campaign
2014cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013cited by this paper
Moses: Open Source Toolkit for Statistical Machine Translation
2007influential reference
Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding
2006cited by this paper
Automatically Constructing a Corpus of Sentential Paraphrases
2005cited by this paper
Statistical Significance Tests for Machine Translation Evaluation
2004cited by this paper
Second Order Derivatives for Network Pruning: Optimal Brain Surgeon
1992cited by this paper
Optimal Brain Damage
1989cited by this paper

CITED BY

POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models
2026cites this paper
Multi-Head Attention Is a Multi-Player Game
2026cites this paper
Entropy Reveals Block Importance in Masked Self-Supervised Vision Transformers
2026cites this paper
FlattenGPT: Depth Compression for Transformer with Layer Flattening
2026cites this paper
Lightweight and Post-Training Structured Pruning for On-Device Large Language Models
2026cites this paper
Elastic Spectral State Space Models for Budgeted Inference
2026cites this paper
Multi-task non-contact ballistocardiogram-based vital signs monitoring in acupuncture
2026cites this paper
From independent patches to coordinated attention: Controlling information flow in vision transformers
2026cites this paper
Compressed Sensing for Capability Localization in Large Language Models
2026cites this paper
A Differentiable Gating Mechanism for DETR: Improving Attention Efficiency in Real-Time Road Anomaly Detection
2026cites this paper
Entropy-Guided Condensing for Vision Transformer
2026cites this paper
Dynamic spectral weighting in CausalSelfAttention: Enhancing transformer performance through frequency-based head modulation
2026cites this paper
SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis
2026cites this paper
ShapLoRA: Allocation of Low-rank Adaption on Large Language Models via Shapley Value Inspired Importance Estimation
2026cites this paper
Interpreting Transformers Through Attention Head Intervention
2026influential citation
Low-Rank Key Value Attention
2026influential citation
Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLMs
2026cites this paper
Greedy-Gnorm: A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning
2026cites this paper
COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression
2026cites this paper
Specialization of softmax attention heads: insights from the high-dimensional single-location model
2026cites this paper
Layer-dependent dynamic spectral weighting for efficient transformer models
2026cites this paper
Retrieval Heads are Dynamic
2026cites this paper
Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals
2026cites this paper
IG-3D: Integrated-Gradients 3D Optimization for Private Transformer Inference
2026cites this paper
Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel
2026cites this paper
From Mice to Trains: Amortized Bayesian Inference on Graph Data
2026cites this paper
Free energy of neural network can predict accuracy after pruning
2026cites this paper
Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis
2026cites this paper
TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction
2026cites this paper
Transformer-based intelligent detection model for early dental caries in panoramic radiographs
2026cites this paper
A singular learning theory for unified large language model pruning
2026cites this paper
Spectral Archaeology: The Causal Topology of Model Evolution
2026cites this paper
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
2026cites this paper
Pruning Attention Heads Based on Semantic and Code Structure for Smart Contract Vulnerability Detection
2026cites this paper
SEGA: Selective cross-lingual representation via sparse guided attention for low-resource multilingual named entity recognition
2026cites this paper
Component-Aware Pruning Framework for Neural Network Controllers via Gradient-Based Importance Estimation
2026cites this paper
Universal Redundancies in Time Series Foundation Models
2026influential citation
Every Bit Counts: A Theoretical Study of Precision-Expressivity Tradeoffs in Quantized Transformers
2026cites this paper
Do Multilingual LLMs have specialized language heads?
2026cites this paper
Lightweight plant phenotypic feature extraction via transferable attention head pruning in Vision Transformers.
2026cites this paper
ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
2026cites this paper
Spectral Attention Steering for Prompt Highlighting
2026cites this paper
Physics of generative AI’s atom: Repetition, bias, and beyond
2026cites this paper
GRAIL: Post-hoc Compensation by Linear Reconstruction for Compressed Networks
2026cites this paper
The Anxiety of Influence: Bloom Filters in Transformer Attention Heads
2026cites this paper
Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning
2026cites this paper
ACL: Aligned Contrastive Learning Improves BERT and Multi-exit BERT Fine-tuning
2026cites this paper
Multi-Scale Manifold Alignment for Interpreting Large Language Models: A Unified Information-Geometric Framework
2025cites this paper
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
2025cites this paper
Sparsified State-Space Models are Efficient Highway Networks
2025influential citation
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
2025cites this paper
The Way We Prompt: Conceptual Blending, Neural Dynamics, and Prompt-Induced Transitions in LLMs
2025cites this paper
SRLoRA: Subspace Recomposition in Low-Rank Adaptation via Importance-Based Fusion and Reinitialization
2025cites this paper
Understanding Differential Transformer Unchains Pretrained Self-Attentions
2025cites this paper
LoKI: Low-damage Knowledge Implanting of Large Language Models
2025cites this paper
K-MSHC: Unmasking Minimally Sufficient Head Circuits in Large Language Models with Experiments on Syntactic Classification Tasks
2025cites this paper
Unveiling Knowledge Utilization Mechanisms in LLM-based Retrieval-Augmented Generation
2025cites this paper
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers
2025cites this paper
AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer
2025cites this paper
AI for Customer Journeys: A Transformer Approach
2025cites this paper
Communication-Efficient Multi-Device Inference Acceleration for Transformer Models
2025cites this paper
Safety Alignment via Constrained Knowledge Unlearning
2025cites this paper
Exploring Religions and Cross-Cultural Sensitivities in Conversational AI
2025cites this paper
SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models
2025cites this paper
An overview of transformers for video anomaly detection
2025cites this paper
Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces
2025cites this paper
ICE-Pruning: An Iterative Cost-Efficient Pruning Pipeline for Deep Neural Networks
2025cites this paper
Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments
2025influential citation
Adaptive Parameter Compression for Language Models
2025influential citation
Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models
2025cites this paper
Detecting the Root Cause Code Lines in Bug-Fixing Commits by Heterogeneous Graph Learning
2025cites this paper
On the effectiveness of Large Language Models in the mechanical design domain
2025cites this paper
Adaptive Head Pruning for Attention Mechanism in the Maritime Domain
2025cites this paper
Efficient Transformer Inference Through Hybrid Dynamic Pruning
2025cites this paper
A Conceptual Framework for Efficient and Sustainable Pruning Techniques in Deep Learning Models
2025cites this paper
Emotion Classification With Visibility Graphs
2025cites this paper
Tiny-ParsBERT: an optimized hybrid model for efficient sentiment analysis in Persian texts
2025cites this paper
RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding
2025cites this paper
Efficient Compressing and Tuning Methods for Large Language Models: A Systematic Literature Review
2025cites this paper
MDF-FND: A dynamic fusion model for multimodal fake news detection
2025cites this paper
AttentionDrop: A Novel Regularization Method for Transformer Models
2025cites this paper
Efficient Token Compression for Vision Transformer with Spatial Information Preserved
2025cites this paper
Transformers and large language models are efficient feature extractors for electronic health record studies
2025cites this paper
ZeroLM: Data-Free Transformer Architecture Search for Language Models
2025cites this paper
Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
2025influential citation
Efficient and performant Transformer private inference with heterogeneous attention mechanisms
2025cites this paper
Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing
2025cites this paper
Temporal Action Detection Model Compression by Progressive Block Drop
2025cites this paper
Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration
2025cites this paper
As easy as PIE: understanding when pruning causes language models to disagree
2025cites this paper
Grouped multi-scale vision transformer for medical image segmentation
2025cites this paper
MDP: Multidimensional Vision Model Pruning with Latency Constraint
2025cites this paper
Saliency-driven Dynamic Token Pruning for Large Language Models
2025cites this paper
Identifying and Evaluating Inactive Heads in Pretrained LLMs
2025cites this paper
Condition-Guided Urban Traffic Co-Prediction With Multiple Sparse Surveillance Data
2025cites this paper
On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models
2025cites this paper
Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning
2025cites this paper
Jekyll-and-Hyde Tipping Point in an AI's Behavior
2025cites this paper
MF2N: Multiview feature fusion network for pancreatic cancer segmentation
2025cites this paper
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
2025cites this paper