Vision Transformers Need Registers

Timothée Darcet,Maxime Oquab,J. Mairal,Piotr Bojanowski

Published 2023 in International Conference on Learning Representations

ABSTRACT

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

PUBLICATION RECORD

Publication year
2023
Venue
International Conference on Learning Representations
Publication date
2023-09-28
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2309.16588 arXiv 2309.16588
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Cut and Learn for Unsupervised Object Detection and Instance Segmentation
2023cited by this paper
Keep It SimPool:Who Said Supervised Transformers Suffer from Attention Deficit?
2023cited by this paper
Emergence of Segmentation with Minimalistic White-Box Transformers
2023cited by this paper
DINOv2: Learning Robust Visual Features without Supervision
2023influential reference
Segment Anything
2023cited by this paper
Adaptive Computation with Elastic Input Sequence
2023cited by this paper
DeiT III: Revenge of the ViT
2022influential reference
Recurrent Memory Transformer
2022cited by this paper
Fine-tuning Image Transformers using Learnable Memory
2022cited by this paper
You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021influential reference
Perceiver: General Perception with Iterative Attention
2021cited by this paper
Emerging Properties in Self-Supervised Vision Transformers
2021influential reference
When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations
2021cited by this paper
BEiT: BERT Pre-Training of Image Transformers
2021cited by this paper
Perceiver IO: A General Architecture for Structured Inputs & Outputs
2021cited by this paper
Localizing Objects with Self-Supervised Transformers and no Labels
2021influential reference
ViDT: An Efficient and Effective Fully Transformer-based Object Detector
2021cited by this paper
Masked Autoencoders Are Scalable Vision Learners
2021cited by this paper
iBOT: Image BERT Pre-Training with Online Tokenizer
2021cited by this paper
Object-Centric Learning with Slot Attention
2020cited by this paper
End-to-End Object Detection with Transformers
2020influential reference
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020influential reference
Memory Transformer
2020influential reference
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Momentum Contrast for Unsupervised Visual Representation Learning
2019cited by this paper
Unsupervised Visual Representation Learning by Context Prediction
2015cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014cited by this paper
ImageNet classification with deep convolutional neural networks
2012cited by this paper
Unbiased look at dataset bias
2011cited by this paper
Distinctive Image Features from Scale-Invariant Keypoints
2004cited by this paper
Distinctive Image Features from Scale-Invariant Keypoints
2004cited by this paper
Conference Paper
year unknowncited by this paper

CITED BY

Optimizing Point-of-Care Ultrasound Video Acquisition for Probabilistic Multi-Task Heart Failure Detection
2026cites this paper
ChartReLA: A compact vision-language model for comprehensive chart reasoning via relationship modeling
2026cites this paper
Laminating Representation Autoencoders for Efficient Diffusion
2026influential citation
Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology
2026cites this paper
SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models
2026cites this paper
One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection
2026cites this paper
Decoding vision transformer variations for image classification: A guide to performance and usability
2026cites this paper
FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion
2026influential citation
CADC: Content Adaptive Diffusion-Based Generative Image Compression
2026cites this paper
A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography
2026cites this paper
OAT: Ordered Action Tokenization
2026cites this paper
Toward Real-World High-Precision Image Matting and Segmentation
2026cites this paper
Boosting Adversarial Transferability of Vision Transformers
2026cites this paper
Driving on Registers
2026cites this paper
KonvLiNA: integrating Kolmogorov-Arnold network with linear Nyström attention for feature fusion in crop disease detection
2026cites this paper
PGSMamba: Prompt-Guided Shuffle State Space Model for Hyperspectral Image Classification
2026cites this paper
DeFM: Learning Foundation Representations from Depth for Robotics
2026cites this paper
SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?
2026influential citation
Enhancing IMU-Based Online Handwriting Recognition via Contrastive Learning with Zero Inference Overhead
2026cites this paper
Xray-Visual Models: Scaling Vision models on Industry Scale Data
2026cites this paper
Specialization of softmax attention heads: insights from the high-dimensional single-location model
2026cites this paper
Vision Transformers Need More Than Registers
2026influential citation
Causal-JEPA: Learning World Models through Object-Level Latent Interventions
2026influential citation
Multiview Self-Representation Learning across Heterogeneous Views
2026cites this paper
Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage
2026cites this paper
ViT Registers and Fractal ViT
2026influential citation
Bi-Orthogonal Factor Decomposition for Vision Transformers
2026cites this paper
FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection
2026cites this paper
FADiaFrame: Improving Fairness and Accuracy of Deep Learning-Based Diagnosis for Dermatological Lesions via a Novel Post-Processing Framework
2026cites this paper
Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
2026cites this paper
PET-TURTLE: Deep Unsupervised Support Vector Machines for Imbalanced Data Clusters
2026cites this paper
LTX-2: Efficient Joint Audio-Visual Foundation Model
2026cites this paper
Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens
2026cites this paper
KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?
2026cites this paper
Beyond efficient fine-tuning: Efficient hybrid fine-tuning of CLIP models guided by explainable ViT attention
2026cites this paper
C-RADIOv4 (Tech Report)
2026cites this paper
Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering
2026cites this paper
Preserving Localized Patch Semantics in VLMs
2026cites this paper
Taming SAM3 in the Wild: A Concept Bank for Open-Vocabulary Segmentation
2026cites this paper
Revisiting [CLS] and Patch Token Interaction in Vision Transformers
2026influential citation
LiDAR-Anchored Collaborative Distillation for Robust 2D Representations
2026cites this paper
MTRAG: Multi-Target Referring and Grounding via Hybrid Semantic-Spatial Integration
2026cites this paper
UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception
2026cites this paper
Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment
2026cites this paper
Locality-Attending Vision Transformer
2026influential citation
Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context
2026cites this paper
HyperMLP: An Integrated Perspective for Sequence Modeling
2026cites this paper
Coden: Efficient Temporal Graph Neural Networks for Continuous Prediction
2026cites this paper
ViT-5: Vision Transformers for The Mid-2020s
2026influential citation
Deep Learning-Based Fixation Type Prediction for Quality Assurance in Digital Pathology
2026cites this paper
MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations
2026cites this paper
When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs
2026cites this paper
GAN and DINOv2 Framework for Robust Cross-Condition Gait Recognition
2026cites this paper
Semi-Supervised Hierarchical Open-Set Classification
2026cites this paper
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
2025cites this paper
Open Ad-hoc Categorization with Contextualized Feature Learning
2025cites this paper
An Empirical Study on Prompt Compression for Large Language Models
2025cites this paper
Transferable Adversarial Attacks on Black-Box Vision-Language Models
2025cites this paper
CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones
2025influential citation
Learning Multi-view Multi-class Anomaly Detection
2025cites this paper
Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook
2025cites this paper
Predicting MammaPrint Recurrence Risk from Breast Cancer Pathological Images Using a Weakly Supervised Transformer
2025cites this paper
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
2025cites this paper
When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars
2025cites this paper
Perception Encoder: The best visual embeddings are not at the output of the network
2025cites this paper
Unlocking Generative Priors: A New Membership Inference Framework for Diffusion Models
2025cites this paper
IXGS-Intraoperative 3D Reconstruction from Sparse, Arbitrarily Posed Real X-rays
2025cites this paper
PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition
2025cites this paper
DINO-Reg: Efficient Multimodal Image Registration With Distilled Features
2025cites this paper
Hypergraph Vision Transformers: Images are More than Nodes, More than Edges
2025cites this paper
MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation
2025cites this paper
Argus: A Compact and Versatile Foundation Model for Vision
2025cites this paper
Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies
2025cites this paper
Accurate Ab-initio Neural-network Solutions to Large-Scale Electronic Structure Problems
2025cites this paper
Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception
2025cites this paper
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
2025cites this paper
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
2025cites this paper
Refining CLIP's Spatial Awareness: A Visual-Centric Perspective
2025influential citation
Generating ensembles of spatially-coherent in-situ forecasts using flow matching
2025cites this paper
ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
2025cites this paper
Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision
2025cites this paper
Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning
2025cites this paper
Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
2025influential citation
Search is All You Need for Few-shot Anomaly Detection
2025influential citation
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
2025cites this paper
Self-Supervised Incremental Learning of Object Representations from Arbitrary Image Sets
2025cites this paper
Learning 3D Scene Analogies with Neural Contextual Scene Maps
2025cites this paper
Pixel-Wise Shuffling with Collaborative Sparsity for Melanoma Hyperspectral Image Classification
2025cites this paper
Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration
2025cites this paper
An interpretable approach to automating the assessment of biofouling in video footage
2025cites this paper
The Power of One: A Single Example is All it Takes for Segmentation in VLMs
2025cites this paper
Sonata: Self-Supervised Learning of Reliable Point Representations
2025cites this paper
Exploring Few-Shot Object Detection on Blood Smear Images: A Case Study of Leukocytes and Schistocytes
2025cites this paper
Beyond Accuracy: What Matters in Designing Well-Behaved Models?
2025cites this paper
Your ViT is Secretly an Image Segmentation Model
2025cites this paper
Beyond Intermediate States: Explaining Visual Redundancy through Language
2025cites this paper
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification
2025cites this paper
Leveraging Diffusion Model and Image Foundation Model for Improved Correspondence Matching in Coronary Angiography
2025cites this paper
Prompt-Guided Attention Head Selection for Focus-Oriented Image Retrieval
2025cites this paper
DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers
2025cites this paper