Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani,J. Djolonga,Basil Mustafa,Piotr Padlewski,J. Heek,J. Gilmer,A. Steiner,Mathilde Caron,Robert Geirhos,Ibrahim M. Alabdulmohsin,Rodolphe Jenatton,Lucas Beyer,Michael Tschannen,Anurag Arnab,Xiao Wang,C. Riquelme,M. Minderer,J. Puigcerver,Utku Evci,Manoj Kumar,Sjoerd van Steenkiste,Gamaleldin F. Elsayed,Aravindh Mahendran,F. Yu,Avital Oliver,Fantine Huot,Jasmijn Bastings,Mark Collier,A. Gritsenko,Vighnesh Birodkar,C. Vasconcelos,Yi Tay,Thomas Mensink,Alexander Kolesnikov,Filip Paveti'c,Dustin Tran,Thomas Kipf,Mario Luvci'c,Xiaohua Zhai,Daniel Keysers,Jeremiah Harmsen,N. Houlsby

Published 2023 in International Conference on Machine Learning

ABSTRACT

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for"LLM-like"scaling in vision, and provides key steps towards getting there.

PUBLICATION RECORD

Publication year
2023
Venue
International Conference on Machine Learning
Publication date
2023-02-10
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2302.05442 arXiv 2302.05442
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Do better ImageNet classifiers assess perceptual similarity better?
2022cited by this paper
Plex: Towards Reliability using Pretrained Large Model Extensions
2022cited by this paper
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
2022cited by this paper
Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
2022cited by this paper
Emergent Abilities of Large Language Models
2022cited by this paper
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
2022cited by this paper
Scaling Out-of-Distribution Detection for Real-World Settings
2022influential reference
Unifying Language Learning Paradigms
2022influential reference
CoCa: Contrastive Captioners are Image-Text Foundation Models
2022influential reference
DeiT III: Revenge of the ViT
2022cited by this paper
PaLM: Scaling Language Modeling with Pathways
2022influential reference
Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
2022influential reference
Efficiently Scaling Transformer Inference
2022influential reference
Scaling Instruction-Finetuned Language Models
2022cited by this paper
PaLI: A Jointly-Scaled Multilingual Language-Image Model
2022cited by this paper
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
2022cited by this paper
Layer-Stack Temperature Scaling
2022cited by this paper
FlexiViT: One Model for All Patch Sizes
2022cited by this paper
ViViT: A Video Vision Transformer
2021cited by this paper
Vision Transformers for Dense Prediction
2021influential reference
Measuring Model Biases in the Absence of Ground Truth
2021cited by this paper
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
2021cited by this paper
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Combined Scaling for Zero-shot Transfer Learning
2021cited by this paper
LiT: Zero-Shot Transfer with Locked-image text Tuning
2021cited by this paper
The Efficiency Misnomer
2021influential reference
SCENIC: A JAX Library for Computer Vision Research and Beyond
2021cited by this paper
Exploring the Limits of Large Scale Pre-training
2021cited by this paper
The Benchmark Lottery
2021cited by this paper
Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation
2021influential reference
The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning
2021cited by this paper
A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models
2021cited by this paper
Revisiting the Calibration of Modern Neural Networks
2021cited by this paper
Partial success in closing the gap between human and machine vision
2021cited by this paper
Scaling Vision with Sparse Mixture of Experts
2021cited by this paper
Knowledge distillation: A good teacher is patient and consistent
2021cited by this paper
Scaling Vision Transformers
2021influential reference
Exploring the Limits of Out-of-Distribution Detection
2021cited by this paper
Correlated Input-Dependent Label Noise in Large-Scale Image Classification
2021cited by this paper
Segmenter: Transformer for Semantic Segmentation
2021cited by this paper
Quantifying Attention Flow in Transformers
2020cited by this paper
Uncovering the Bias in Facial Expressions
2020cited by this paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020influential reference
Cross-Domain Few-Shot Learning by Representation Fusion
2020cited by this paper
Self-Supervised Learning for Monocular Depth Estimation from Aerial Imagery
2020cited by this paper
On Robustness and Transferability of Convolutional Neural Networks
2020cited by this paper
BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning
2020cited by this paper
Measuring Robustness to Natural Distribution Shifts in Image Classification
2020cited by this paper
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
2020influential reference
Are we done with ImageNet?
2020cited by this paper
A domain-specific supercomputer for training deep neural networks
2020influential reference
On the Relationship between Self-Attention and Convolutional Layers
2019cited by this paper
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
2019influential reference
Do ImageNet Classifiers Generalize to ImageNet?
2019cited by this paper
Root Mean Square Layer Normalization
2019cited by this paper
Metric Learning for Patch Classification in Digital Pathology
2019cited by this paper
Natural Adversarial Examples
2019influential reference
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
2019cited by this paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
2019cited by this paper
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning
2019cited by this paper
Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation
2019cited by this paper
Scalability in Perception for Autonomous Driving: Waymo Open Dataset
2019cited by this paper
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
2019influential reference
Big Transfer (BiT): General Visual Representation Learning
2019cited by this paper
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
2018cited by this paper
Exploring the Limits of Weakly Supervised Pretraining
2018cited by this paper
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
2018cited by this paper
Moments in Time Dataset: One Million Videos for Event Understanding
2018influential reference
Model Cards for Model Reporting
2018cited by this paper
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
2018cited by this paper
Unified Perceptual Parsing for Scene Understanding
2018cited by this paper
Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning
2018cited by this paper
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
2018influential reference
Places: A 10 Million Image Database for Scene Recognition
2018influential reference
Women also Snowboard: Overcoming Bias in Captioning Models
2018cited by this paper
Scene Parsing through ADE20K Dataset
2017influential reference
EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification
2017cited by this paper
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
2017influential reference
Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints
2017cited by this paper
Axiomatic Attribution for Deep Networks
2017cited by this paper
Remote Sensing Image Scene Classification: Benchmark and State of the Art
2017cited by this paper
On Calibration of Modern Neural Networks
2017influential reference
The Kinetics Human Action Video Dataset
2017cited by this paper
Multi-class texture analysis in colorectal cancer histology
2016cited by this paper
Semantics derived automatically from language corpora contain human-like biases
2016cited by this paper
Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment
2016cited by this paper
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
2016cited by this paper
SGDR: Stochastic Gradient Descent with Warm Restarts
2016influential reference
Layer Normalization
2016influential reference
Obtaining Well Calibrated Probabilities Using Bayesian Binning
2015influential reference
Distilling the Knowledge in a Neural Network
2015cited by this paper
Adam: A Method for Stochastic Optimization
2014influential reference
The Role of Context for Object Detection and Semantic Segmentation in the Wild
2014cited by this paper
Going deeper with convolutions
2014cited by this paper
Deep Learning Face Attributes in the Wild
2014cited by this paper
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
2014cited by this paper
Vision meets robotics: The KITTI dataset
2013influential reference
3D Object Representations for Fine-Grained Categorization
2013cited by this paper
Describing Textures in the Wild
2013cited by this paper

CITED BY

SigMa: Semantic Similarity-Guided Semi-Dense Feature Matching
2026cites this paper
FlashOptim: Optimizers for Memory Efficient Training
2026cites this paper
Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols
2026cites this paper
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
2026cites this paper
SADER: Structure-Aware Diffusion Framework with DEterministic Resampling for Multi-Temporal Remote Sensing Cloud Removal
2026cites this paper
SageBwd: A Trainable Low-bit Attention
2026cites this paper
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
2026cites this paper
A Visually Explainable Dynamic Similarity Network for Few-Shot Classification
2026cites this paper
The Design Space of Tri-Modal Masked Diffusion Models
2026cites this paper
GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization
2026cites this paper
DIETA: A Decoder-only Transformer-based Model for Italian-English Machine TrAnslation
2026cites this paper
ReaMIL: Reasoning- and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology
2026cites this paper
Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges
2026cites this paper
Preserving Continuous Symmetry in Discrete Spaces: Geometric-Aware Quantization for SO(3)-Equivariant GNNs
2026influential citation
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
2026cites this paper
FloydNet: A Learning Paradigm for Global Relational Reasoning
2026cites this paper
Toward Generalist Neural Motion Planners for Robotic Manipulators: Challenges and Opportunities
2026cites this paper
VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
2026cites this paper
ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips
2026cites this paper
Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing
2026cites this paper
ViT-5: Vision Transformers for The Mid-2020s
2026cites this paper
LINA: Linear Autoregressive Image Generative Models with Continuous Tokens
2026cites this paper
MoE3D: A Mixture-of-Experts Module for 3D Reconstruction
2026cites this paper
SigGen: Signal Generation for Wireless Sensing Based on Disentangled Representation
2026cites this paper
Quantized SO(3)-Equivariant Graph Neural Networks for Efficient Molecular Property Prediction
2026cites this paper
On the foundations of Earth foundation models
2026cites this paper
TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse
2026influential citation
Low-Pass Filtering Improves Behavioral Alignment of Vision Models
2026cites this paper
Stateful Cross-layer Vision Modulation
2026cites this paper
Scaling Dense Event-Stream Pretraining from Visual Foundation Models
2026cites this paper
Output Embedding Centering for Stable LLM Pretraining
2026cites this paper
Bi-Orthogonal Factor Decomposition for Vision Transformers
2026cites this paper
The Comparison of Human and Machine Performance in Object Recognition
2026cites this paper
RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture
2026cites this paper
An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
2026influential citation
IMU-1: Sample-Efficient Pre-training of Small Language Models
2026cites this paper
Efficient and Controllable Image Generation on the Edge: A Survey on Algorithmic and Architectural Optimization
2026cites this paper
Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning
2025cites this paper
SPL-PlaneTR: Lightweight and Generalizable Indoor Plane Segmentation Based on Prompt Learning
2025cites this paper
Deepfakes on Demand: The rise of accessible non-consensual deepfake image generators
2025cites this paper
Dimensionality and dynamics for next-generation artificial neural networks
2025cites this paper
Toward Diverse Tiny-Model Selection for Microcontrollers
2025influential citation
Better artificial intelligence does not mean better models of biology
2025cites this paper
Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation
2025influential citation
Scaling Laws for Native Multimodal Models
2025cites this paper
Deep learning, transformers and graph neural networks: a linear algebra perspective
2025cites this paper
On Model and Data Scaling for Skeleton-based Self-Supervised Gait Recognition
2025cites this paper
MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs
2025cites this paper
ImF: Implicit Fingerprint for Large Language Models
2025cites this paper
Illuminating breast cancer malignancies: Lightweight histopathology computer vision classifier for precise breast cancer screening
2025cites this paper
Human-like monocular depth biases in deep neural networks
2025cites this paper
RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation
2025cites this paper
Adaptive Computation Pruning for the Forgetting Transformer
2025cites this paper
Deep learning-driven medical image analysis for computational material science applications
2025cites this paper
Perception Encoder: The best visual embeddings are not at the output of the network
2025cites this paper
Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving
2025cites this paper
Exploring Training and Inference Scaling Laws in Generative Retrieval
2025cites this paper
Efficient Self-Supervised Adaptation for Medical Image Analysis
2025cites this paper
A Genealogy of Foundation Models in Remote Sensing
2025cites this paper
Low-Rank Adaptation vs. Fine-Tuning for Handwritten Text Recognition
2025cites this paper
Direct Motion Models for Assessing Generated Videos
2025cites this paper
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
2025cites this paper
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better
2025cites this paper
VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models
2025cites this paper
GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving
2025cites this paper
UniViTAR: Unified Vision Transformer with Native Resolution
2025influential citation
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
2025cites this paper
TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice
2025cites this paper
APLA: A Simple Adaptation Method for Vision Transformers
2025cites this paper
Generalization abilities of foundation models in waste classification.
2025cites this paper
Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari
2025cites this paper
Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning
2025cites this paper
Proteina: Scaling Flow-based Protein Structure Generative Models
2025cites this paper
Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies
2025cites this paper
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
2025cites this paper
Bayesian Computation in Deep Learning
2025influential citation
RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness
2025cites this paper
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
2025cites this paper
Object-Centric Latent Action Learning
2025cites this paper
Medical multimodal multitask foundation model for lung cancer screening
2025cites this paper
RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals
2025cites this paper
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
2025cites this paper
Large-scale benchmarking and boosting transfer learning for medical image analysis.
2025cites this paper
Hyperspherical Normalization for Scalable Deep Reinforcement Learning
2025cites this paper
Function-Space Learning Rates
2025cites this paper
How Do Large Language Monkeys Get Their Power (Laws)?
2025cites this paper
BAnG: Bidirectional Anchored Generation for Conditional RNA Design
2025cites this paper
Unsupervised Parameter Efficient Source-free Post-pretraining
2025cites this paper
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression
2025cites this paper
Forgetting Transformer: Softmax Attention with a Forget Gate
2025cites this paper
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
2025cites this paper
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
2025cites this paper
Adversarial Robustness of Discriminative Self-Supervised Learning in Vision
2025cites this paper
DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation
2025cites this paper
Pixel-Wise Shuffling with Collaborative Sparsity for Melanoma Hyperspectral Image Classification
2025cites this paper
Scaling Vision Pre-Training to 4K Resolution
2025cites this paper
Gemma 3 Technical Report
2025cites this paper
Your ViT is Secretly an Image Segmentation Model
2025cites this paper
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
2025cites this paper
UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
2025cites this paper