Revealing the Dark Secrets of BERT

Olga Kovaleva,Alexey Romanov,Anna Rogers,Anna Rumshisky

Published 2019 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

BERT-based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT’s heads. Our findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization. While different heads consistently use the same attention patterns, they have varying impact on performance across different tasks. We show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models.

PUBLICATION RECORD

Publication year
2019
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
2019-08-21
Fields of study
Mathematics, Computer Science, Psychology
Identifiers
DOI 10.18653/v1/D19-1445 arXiv 1908.08593
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

What Does BERT Learn about the Structure of Language?
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Assessing BERT's Syntactic Abilities
2019cited by this paper
Pay Less Attention with Lightweight and Dynamic Convolutions
2019cited by this paper
Linguistic Knowledge and Transferability of Contextual Representations
2019cited by this paper
Rethinking Complex Neural Network Architectures for Document Classification
2019influential reference
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
2019cited by this paper
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
2019cited by this paper
Are Sixteen Heads Really Better than One?
2019cited by this paper
Lessons from Natural Language Inference in the Clinical Domain
2018cited by this paper
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
2018cited by this paper
Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures
2018cited by this paper
The Importance of Being Recurrent for Modeling Hierarchical Structure
2018cited by this paper
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
2018cited by this paper
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
2017cited by this paper
Attention is All you Need
2017cited by this paper
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
2017cited by this paper
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016cited by this paper
How transferable are features in deep neural networks?
2014cited by this paper
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013cited by this paper
Automatically Constructing a Corpus of Sentential Paraphrases
2005cited by this paper
The Berkeley FrameNet Project
1998influential reference
The empirical base of linguistics: Grammaticality judgments and linguistic methodology
1998cited by this paper
The Empirical Base of Linguistics: Grammaticality Judgments and Linguistic Methodology
1997influential reference

CITED BY

DTSNet: Dynamic Transformer Slimming for Efficient Vision Recognition
2026cites this paper
LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models
2026cites this paper
Incremental alternative sampling as a lens into the temporal and representational resolution of linguistic prediction
2026cites this paper
Decomposing Query-Key Feature Interactions Using Contrastive Covariances
2026cites this paper
From Global to Local: Learning Context-Aware Graph Representations for Document Classification and Summarization
2026cites this paper
Single-Nodal Spontaneous Symmetry Breaking in NLP Models
2026cites this paper
Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning
2026cites this paper
Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis
2026cites this paper
Augmenting Parameter-Efficient Pre-trained Language Models with Large Language Models
2026cites this paper
Overcoming BERT's limitations in uncertainty: A novel two-stage solution for multi-class medical text classification
2026cites this paper
Enabling Efficient SpMM for Sparse Attention on GEMM-Optimized Hardware with Block Aggregation
2026influential citation
Explainable AI: Context-aware layer-wise integrated gradients for explaining transformer models
2026cites this paper
Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals
2026cites this paper
Scalable and Culturally Specific Stereotype Dataset Construction via Human-LLM Collaboration
2025cites this paper
Start Making Sense(s): A Developmental Probe of Attention Specialization Using Lexical Ambiguity
2025cites this paper
Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation
2025cites this paper
Lowering the Barrier of Machine Learning: Achieving Zero Manual Labeling in Review Classification Using LLMs
2025cites this paper
Pre-Training a Graph Recurrent Network for Text Understanding
2025cites this paper
Hallucinated Span Detection with Multi-View Attention Features
2025cites this paper
That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation
2025cites this paper
A Task-Specific Fine-Tuning Strategy and Evaluation for LLM-based Log Analysis
2025cites this paper
Applications of Large Language Model in HVDC Systems: Concepts, Development, and Perspectives
2025cites this paper
Self-attention vector output similarities reveal how machines pay attention
2025cites this paper
An IoT-Ready Framework for Predictive Healthcare Using INGA Feature Selection and Six-Classifier Assessment
2025cites this paper
Multi-Head Explainer: A General Framework to Improve Explainability in CNNs and Transformers
2025cites this paper
Parameter-Efficient Fine-Tuning for Foundation Models
2025cites this paper
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
2025cites this paper
"Actionable Help" in Crises: A Novel Dataset and Resource-Efficient Models for Identifying Request and Offer Social Media Posts
2025cites this paper
A Comprehensive Survey on Clinical Models for AI-Powered Medical Applications
2025cites this paper
REIA: Entity Relation Extraction Based on Interaction Policy and Data Augmentation
2025cites this paper
Exploring Shared-Weight Mechanisms in Transformer and Conformer Architectures for Automatic Speech Recognition
2025cites this paper
Information Integration in Large Language Models is Gated by Linguistic Structural Markers
2025cites this paper
brat: Aligned Multi-View Embeddings for Brain MRI Analysis
2025cites this paper
A method for semantic textual similarity on long texts
2025cites this paper
To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
2025cites this paper
You Might Not Need Attention Diagonals
2025cites this paper
L-CLIPScore: a Lightweight Embedding-based Captioning Metric for Evaluating and Training
2025cites this paper
Investigating Multi-layer Representations for Dense Passage Retrieval
2025cites this paper
CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions
2025influential citation
Language Models Can be Efficiently Steered via Minimal Embedding Layer Transformations
2025cites this paper
A Hybrid Classical-Quantum Fine Tuned BERT for Text Classification
2025cites this paper
Optimizing Reinforcement Learning with Limited HRI Demonstrations: A Task-Oriented Weight Update Method with Analysis of Multi-head and Layer Feature Combinations
2025cites this paper
Integral Transformer: Denoising Attention, Not Too Much Not Too Little
2025cites this paper
Quantum-Assisted Optimization for Transformer Models: Insights and Implications
2025cites this paper
BERT SENTIMENT ANALYSIS FOR DETECTING FRAUDULENT MESSAGES
2025cites this paper
Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies
2025cites this paper
RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer
2025cites this paper
STP: Special token prompt for parameter-efficient tuning of pre-trained language models
2025cites this paper
A Survey of Multimodal Fake News Detection: A Cross-Modal Interaction Perspective
2025cites this paper
Do Large Language Models know who did what to whom?
2025cites this paper
Intra-Layer Recurrence in Transformers for Language Modeling
2025cites this paper
Dual Masking for Mutual Masked Autoencoderefficient Feature Extraction Without Large-to-Small Compression
2025cites this paper
Hmltnet: multi-modal fake news detection via hierarchical multi-grained features fused with global latent topic
2025cites this paper
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models
2025cites this paper
Learning Task Representations from In-Context Learning
2025cites this paper
AxBERT: An Interpretable Chinese Spelling Correction Method Driven by Associative Knowledge Network
2025cites this paper
The geometry of BERT
2025cites this paper
On the Query Complexity of Verifier-Assisted Language Generation
2025cites this paper
BERT applications in natural language processing: a review
2025cites this paper
Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?
2025cites this paper
Effects of Attention Head Pruning on Encoder-only Language Models for Multilingual Recipe Classification
2025cites this paper
Optimized Domain-Specific Text Processing with Keyword Knowledge Distillation (KKD)
2025cites this paper
Attention Is All You Need for KV Cache in Diffusion LLMs
2025cites this paper
A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore
2025cites this paper
Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
2025cites this paper
Fast and Low-Cost Genomic Foundation Models via Outlier Removal
2025cites this paper
Dual-Stream Attention for Small Object Detection
2025cites this paper
Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation
2025cites this paper
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation
2025cites this paper
Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models
2025cites this paper
Interpreting Attention Mechanisms in Genomic Transformer Models: A Framework for Biological Insights
2025cites this paper
Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis
2025cites this paper
Unveiling Super Experts in Mixture-of-Experts Large Language Models
2025cites this paper
Investigation of the impact of token embeddings in Transformer-based models on short-term tropical cyclone track and intensity predictions
2025cites this paper
White Aggregation and Restoration for Few-shot 3D Point Cloud Semantic Segmentation
2025cites this paper
Adaptive noise-augmented attention for enhancing Transformer fine-tuning on longitudinal medical data
2025cites this paper
nDNA -- the Semantic Helix of Artificial Cognition
2025cites this paper
Comprehensive Benchmark: Comparing Classical, Recurrent, and Transformer-Based Models for Sentiment Analysis Across Text Length Strata
2025cites this paper
Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance
2025cites this paper
Exposing Hate - Understanding Anti-Immigration Sentiment Spreading on Twitter
2024cites this paper
Expertise Identification Using Transformers
2024cites this paper
Only Send What You Need: Learning to Communicate Efficiently in Federated Multilingual Machine Translation
2024cites this paper
Anchor function: a type of benchmark functions for studying language models
2024cites this paper
Transformer Doctor: Diagnosing and Treating Vision Transformers
2024cites this paper
Better Explain Transformers by Illuminating Important Information
2024influential citation
Adaptive Ensemble Self-Distillation With Consistent Gradients for Fast Inference of Pretrained Language Models
2024cites this paper
TrajLearn: Trajectory Prediction Learning using Deep Generative Models
2024cites this paper
Compressing Neural Networks using Learnable 1D Non-Linear Functions
2024cites this paper
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
2024cites this paper
Pre-trained BERT Architecture Analysis for Indonesian Question Answer Model
2024cites this paper
Towards Parameter-Efficient Non-Autoregressive Spanish Audio-Visual Speech Recognition
2024cites this paper
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
2024cites this paper
Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models
2024cites this paper
Abrupt Learning in Transformers: A Case Study on Matrix Completion
2024influential citation
Textual form features for text readability assessment
2024influential citation
CompFreeze : Combining Compacters and Layer Freezing for Enhanced Pre-Trained Language Model
2024cites this paper
Larger models yield better results? Streamlined severity classification of ADHD-related concerns using BERT-based knowledge distillation
2024cites this paper
Activating Self-Attention for Multi-Scene Absolute Pose Regression
2024cites this paper
F3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics
2024cites this paper
SkipPLUS: Skip the First Few Layers to Better Explain Vision Transformers
2024cites this paper