On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization

Published 2020 in International Joint Conference on Artificial Intelligence

ABSTRACT

The prevailing thinking is that orthogonal weights are crucial to enforcing dynamical isometry and speeding up training. The increase in learning speed that results from orthogonal initialization in linear networks has been well-proven. However, while the same is believed to also hold for nonlinear networks when the dynamical isometry condition is satisfied, the training dynamics behind this contention have not been thoroughly explored. In this work, we study the dynamics of ultra-wide networks across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs) with orthogonal initialization via neural tangent kernel (NTK). Through a series of propositions and lemmas, we prove that two NTKs, one corresponding to Gaussian weights and one to orthogonal weights, are equal when the network width is infinite. Further, during training, the NTK of an orthogonally-initialized infinite-width network should theoretically remain constant. This suggests that the orthogonal initialization cannot speed up training in the NTK (lazy training) regime, contrary to the prevailing thoughts. In order to explore under what circumstances can orthogonality accelerate training, we conduct a thorough empirical investigation outside the NTK regime. We find that when the hyper-parameters are set to achieve a linear regime in nonlinear activation, orthogonal initialization can improve the learning speed with a large learning rate or large depth.

PUBLICATION RECORD

Publication year
2020
Venue
International Joint Conference on Artificial Intelligence
Publication date
2020-04-13
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.24963/ijcai.2021/355 arXiv 2004.05867
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks
2020influential reference
Feature Extraction and Image Processing for Computer Vision
2020cited by this paper
5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
2020cited by this paper
On the linearity of large non-linear models: when and why the tangent kernel is constant
2020influential reference
Tensor Programs II: Neural Tangent Kernel for Any Architecture
2020cited by this paper
Constructing exchangeable pairs by diffusion on manifolds and its application
2020cited by this paper
The large learning rate phase of deep learning: the catapult mechanism
2020cited by this paper
On the infinite width limit of neural networks with a standard parameterization
2020cited by this paper
Gradient descent optimizes over-parameterized deep ReLU networks
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation
2019cited by this paper
Wide neural networks of any depth evolve as linear models under gradient descent
2019influential reference
A Mean Field Theory of Batch Normalization
2019cited by this paper
On Exact Computation with an Infinitely Wide Neural Net
2019cited by this paper
Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks
2019cited by this paper
Dynamics of Deep Neural Networks and Neural Tangent Hierarchy
2019cited by this paper
Neural Tangents: Fast and Easy Infinite Neural Networks in Python
2019cited by this paper
Mean field theory for deep dropout networks: digging up gradient backpropagation deeply
2019cited by this paper
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks
2018cited by this paper
Information Geometry of Orthogonal Initializations and Training
2018cited by this paper
Gaussian Process Behaviour in Wide Deep Neural Networks
2018cited by this paper
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
2018cited by this paper
Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function
2018cited by this paper
Spectrum Concentration in Deep Residual Learning: A Free Probability Approach
2018cited by this paper
On Lazy Training in Differentiable Programming
2018cited by this paper
A Convergence Theory for Deep Learning via Over-Parameterization
2018cited by this paper
The Emergence of Spectral Universality in Deep Networks
2018cited by this paper
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
2018influential reference
Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks
2018cited by this paper
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
2017influential reference
Deep Neural Networks as Gaussian Processes
2017cited by this paper
Mean Field Residual Networks: On the Edge of Chaos
2017cited by this paper
Deep Information Propagation
2016cited by this paper
Exponential expressivity in deep neural networks through transient chaos
2016cited by this paper
Human-level control through deep reinforcement learning
2015cited by this paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Dropout: a simple way to prevent neural networks from overfitting
2014cited by this paper
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
2013cited by this paper
MULTIVARIATE NORMAL APPROXIMATION USING EXCHANGEABLE PAIRS
2007cited by this paper

CITED BY

UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models
2025cites this paper
Phase-Adaptive Reinforcement Learning for Self-Tuning PID Control of Cruise Missiles
2025cites this paper
Fourth-Order Dimension Preserved Tensor Completion With Temporal Constraint for Missing Traffic Data Imputation
2025cites this paper
Stochastic PFRCosSim layer for solving filter redundancy problem in CNNs applied on plant disease classification
2025cites this paper
Regularized Over-Parametrized Neural Networks Learned by Gradient Descent Can Generalize Well
2025cites this paper
Mitigating Update Conflict in Non-IID Federated Learning via Orthogonal Class Gradients
2025cites this paper
Classification Performance Boosting for Interpolation Kernel Machines by Training Set Pruning Using Genetic Algorithm
2024cites this paper
Analysis of the rate of convergence of an over-parametrized convolutional neural network image classifier learned by gradient descent
2024cites this paper
Three Mechanisms of Feature Learning in a Linear Network
2024cites this paper
On the convergence analysis of over-parameterized variational autoencoders: a neural tangent kernel perspective
2024cites this paper
When Does Feature Learning Happen? Perspective from an Analytically Solvable Model
2024cites this paper
Rate of Convergence of an Over-Parametrized Convolutional Neural Network Image Classifier Learned by Gradient Descent
2024cites this paper
Adaptive Methods for Kernel Initialization of Convolutional Neural Network Model Applied to Plant Disease Classification
2024cites this paper
Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning
2023cites this paper
TLNets: Transformation Learning Networks for long-range time-series prediction
2023cites this paper
Distortion-Disentangled Contrastive Learning
2023cites this paper
Implicit Bias of Deep Learning in the Large Learning Rate Phase: A Data Separability Perspective
2023cites this paper
Uniform Convergence of Deep Neural Networks With Lipschitz Continuous Activation Functions and Variable Widths
2023cites this paper
Feature learning and generalization in deep networks with orthogonal weights
2023influential citation
Hierarchical Kernels in Deep Kernel Learning
2023cites this paper
Demystify Optimization and Generalization of Over-parameterized PAC-Bayesian Learning
2022cites this paper
Rethinking adversarial domain adaptation: Orthogonal decomposition for unsupervised domain adaptation in medical image segmentation
2022cites this paper
Deep equilibrium networks are sensitive to initialization statistics
2022cites this paper
Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis
2022cites this paper
Convergence of Deep ReLU Networks
2021cites this paper
On the Equivalence between Neural Network and Support Vector Machine
2021cites this paper
of the Bernoulli Society for Mathematical Statistics and Probability Volume Twenty Seven Number Four November 2021
2021cites this paper
Kajian Kinerja Metode Support Vector Machine Dan Neural Tangent Kernel Untuk Memprediksi Hasil Ujian Siswa
2021cites this paper
Activation function design for deep networks: linearity and effective initialisation
2021cites this paper
On the validity of kernel approximations for orthogonally-initialized neural networks
2021cites this paper
Wide Graph Neural Networks: Aggregation Provably Leads to Exponentially Trainability Loss
2021cites this paper
Eigenspace Restructuring: a Principle of Space and Frequency in Neural Networks
2021cites this paper
Deep Active Learning by Leveraging Training Dynamics
2021cites this paper
Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective
2021cites this paper
Constructing exchangeable pairs by diffusion on manifolds and its application
2020cites this paper
Tensor Programs II: Neural Tangent Kernel for Any Architecture
2020cites this paper
Finite Versus Infinite Neural Networks: an Empirical Study
2020cites this paper
Towards NNGP-guided Neural Architecture Search
2020cites this paper
The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry
2020cites this paper
Implicit bias of deep linear networks in the large learning rate phase
2020cites this paper