Visualizing the Loss Landscape of Neural Nets

Hao Li,Zheng Xu,Gavin Taylor,T. Goldstein

Published 2017 in Neural Information Processing Systems

ABSTRACT

Neural network training relies on our ability to find "good" minimizers of highly non-convex loss functions. It is well-known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effects on the underlying loss landscape, are not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple "filter normalization" method that helps us visualize loss function curvature and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.

PUBLICATION RECORD

Publication year
2017
Venue
Neural Information Processing Systems
Publication date
2017-12-28
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1712.09913
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Theory II: Landscape of the Empirical Risk in Deep Learning
2017cited by this paper
Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks
2017cited by this paper
Global Optimality in Neural Network Training
2017cited by this paper
The Loss Surface of Deep and Wide Neural Networks
2017cited by this paper
Generalization in Deep Learning
2017cited by this paper
Global optimality conditions for deep neural networks
2017cited by this paper
Depth Creates No Bad Local Minima
2017cited by this paper
Sharp Minima Can Generalize For Deep Nets
2017cited by this paper
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
2017cited by this paper
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
2017cited by this paper
Local minima in training of deep networks
2017cited by this paper
Exploring Generalization in Deep Learning
2017cited by this paper
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
2017cited by this paper
Exponentially vanishing sub-optimal local minima in multilayer neural networks
2017cited by this paper
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
2017cited by this paper
An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis
2017cited by this paper
Energy landscapes for machine learning.
2017cited by this paper
Automated Inference with Adaptive Batches
2017cited by this paper
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
2017cited by this paper
Exploring loss function topology with cyclical learning rates
2017cited by this paper
Identity Matters in Deep Learning
2016cited by this paper
Diverse Neural Network Learns True Target Functions
2016cited by this paper
Densely Connected Convolutional Networks
2016influential reference
Wide Residual Networks
2016cited by this paper
PhaseMax: Convex Phase Retrieval via Basis Pursuit
2016cited by this paper
Stuck in a What? Adventures in Weight Space
2016cited by this paper
Local minima in training of neural networks
2016cited by this paper
Deep Learning without Poor Local Minima
2016cited by this paper
Visualizing Deep Network Training Trajectories with PCA
2016cited by this paper
An Empirical Analysis of Deep Network Loss Surfaces
2016influential reference
An empirical analysis of the optimization of deep network loss surfaces
2016cited by this paper
Entropy-SGD: biasing gradient descent into wide valleys
2016cited by this paper
Topology and Geometry of Half-Rectified Network Optimization
2016cited by this paper
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
2016influential reference
Understanding deep learning requires rethinking generalization
2016cited by this paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
On the Quality of the Initial Basin in Overspecified Neural Networks
2015cited by this paper
Deep Residual Learning for Image Recognition
2015influential reference
Qualitatively characterizing neural network optimization problems
2014influential reference
The Loss Surfaces of Multilayer Networks
2014cited by this paper
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
2014cited by this paper
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014cited by this paper
Understanding the difficulty of training deep feedforward neural networks
2010influential reference
Visualization of learning in multilayer perceptron networks using principal component analysis
2003cited by this paper
Flat Minima
1997cited by this paper
Original Contribution: Training a 3-node neural network is NP-complete
1992cited by this paper
A Simple Weight Decay Can Improve Generalization
1991cited by this paper

CITED BY

Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach
2026cites this paper
AttCo: Attention-based co-Learning fusion of deep feature representation for medical image segmentation using multimodality
2026cites this paper
Adaptive Momentum and Nonlinear Damping for Neural Network Training
2026cites this paper
Relatron: Automating Relational Machine Learning over Relational Databases
2026cites this paper
Sparsity-Aware Evolution for Model Merging
2026cites this paper
Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?
2026cites this paper
Transient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent
2026cites this paper
Hessian Spectral Analysis at Foundation Model Scale
2026cites this paper
Statistical Roughness-Informed Machine Unlearning
2026cites this paper
Neural network optimization strategies and the topography of the loss landscape
2026cites this paper
A Quantization Strategy for Federated Learning in Agricultural Internet of Things Devices
2026cites this paper
Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate
2026cites this paper
Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models
2026cites this paper
Training instability in deep learning follows low-dimensional dynamical principles
2026cites this paper
Constrained collaborative optimization of charged particle tracking with multi-agent reinforcement learning
2026cites this paper
Pruning as Evolution: Emergent Sparsity Through Selection Dynamics in Neural Networks
2026cites this paper
NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning
2026cites this paper
Toward Interpretable and Generalizable AI in Regulatory Genomics
2026cites this paper
Decoupled Orthogonal Dynamics: Regularization for Deep Network Optimizers
2026cites this paper
Landscaper: Understanding Loss Landscapes Through Multi-Dimensional Topological Analysis
2026influential citation
Amortized Predictability-aware Training Framework for Time Series Forecasting and Classification
2026cites this paper
The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure
2026cites this paper
SRasP: Self-Reorientation Adversarial Style Perturbation for Cross-Domain Few-Shot Learning
2026cites this paper
GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry
2026cites this paper
Fractional Order Federated Learning for Battery Electric Vehicle Energy Consumption Modeling
2026cites this paper
Acceleration for Polyak-{\L}ojasiewicz Functions with a Gradient Aiming Condition
2026cites this paper
Visualizing the loss landscapes of physics-informed neural networks
2026influential citation
Why Some Models Resist Unlearning: A Linear Stability Perspective
2026cites this paper
Visual Prompt-Agnostic Evolution
2026cites this paper
How Worst-Case Are Adversarial Attacks? Linking Adversarial and Perturbation Robustness
2026cites this paper
Neighborhood and Global Perturbations Supported Sharpness-Aware Minimization in Federated Learning: From Local Tweaks to Global Awareness
2026cites this paper
A multi-stage physics-informed neural network for high-resolution reconstruction of physical fields with sharp gradients
2026cites this paper
Softly Induced Functional Simplicity: Implications for Neural Network Generalisation, Robustness, and Distillation
2026cites this paper
Robust Federated Learning Against Model Perturbation in Edge Networks
2026cites this paper
FM-GCN: Spatio-Temporal Graph Neural Network for Gender Recognition Using Facial Micro-Expressions
2026cites this paper
Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization
2026cites this paper
Riemannian Lyapunov Optimizer: A Unified Framework for Optimization
2026cites this paper
On the Relationship Between Representation Geometry and Generalization in Deep Neural Networks
2026cites this paper
Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost
2026cites this paper
Continuized Nesterov Momentum Achieves the $O(\varepsilon^{-7/4})$ Complexity without Additional Mechanisms
2026cites this paper
Incorruptible Neural Networks: Training Models that can Generalize to Large Internal Perturbations
2026cites this paper
ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling
2026influential citation
Roughness-Informed Federated Learning
2026cites this paper
Diagnostic Benchmarks for Invariant Learning Dynamics: Empirical Validation of the Eidos Architecture
2026cites this paper
JPmHC Dynamical Isometry via Orthogonal Hyper-Connections
2026cites this paper
Leap+Verify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training
2026cites this paper
Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels
2026cites this paper
Non-Euclidean Gradient Descent Operates at the Edge of Stability
2026cites this paper
Embedding interpretable $\ell_1$-regression into neural networks for uncovering temporal structure in cell imaging
2026cites this paper
Layer-wise QUBO-Based Training of CNN Classifiers for Quantum Annealing
2026cites this paper
Cut Less, Fold More: Model Compression through the Lens of Projection Geometry
2026cites this paper
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
2026cites this paper
Dynamic shortcut connections of deep residual neural network
2026cites this paper
Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks
2026cites this paper
Local sharpness aware minimization in decentralized federated learning with privacy protection
2026cites this paper
An explainable and transferable deep learning framework for spatiotemporal urban flood prediction by integrating Vision Transformer and U-Net.
2026cites this paper
Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization
2026cites this paper
TraceNAS: Zero-shot LLM Pruning via Gradient Trace Correlation
2026cites this paper
Depth, Not Data: An Analysis of Hessian Spectral Bifurcation
2026cites this paper
Training Memory in Deep Neural Networks: Mechanisms, Evidence, and Measurement Gaps
2026cites this paper
Physics-guided curriculum learning for the identification of reaction-diffusion dynamics from partial observations
2026cites this paper
A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs
2026cites this paper
VPrWS: Vision transformer-based prototypical network with cross-inductive bias distillation for cross-domain wireless human sensing
2026cites this paper
Communication Efficient Over-the-Air Federated Learning With Random FLARE Algorithm
2026cites this paper
Deep learning and the geometry of compactness in stability and generalization
2026cites this paper
Detection of mining-induced microseismicity through a deep convolutional neural network
2026cites this paper
Toward imperceptible 3D adversarial point clouds via gradient-guided optimization
2026cites this paper
XBTorch: A Unified Framework for Modeling and Co-Design of Crossbar-Based Deep Learning Accelerators
2026cites this paper
HiSkew: a novel histogram skewness classification applied to face anti-spoofing
2025cites this paper
DNAD: Differentiable Neural Architecture Distillation
2025cites this paper
On-Chip Age Estimation Using Machine Learning
2025cites this paper
The effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks
2025cites this paper
Unveiling and Mitigating Adversarial Vulnerabilities in Iterative Optimizers
2025cites this paper
Physics-Informed Neural Network for Parameter Identification: a Buck Converter Case Study
2025cites this paper
Connecting Independently Trained Modes via Layer-Wise Connectivity
2025cites this paper
The effects of Hessian eigenvalue spectral density type on the applicability of Hessian analysis to generalization capability assessment of neural networks
2025cites this paper
Regularizing Differentiable Architecture Search with Smooth Activation
2025cites this paper
Enhancing Variational Autoencoders with Smooth Robust Latent Encoding
2025cites this paper
Preserving Label Correlation for Multi-label Text Classification by Prototypical Regularizations
2025influential citation
AlphaGrad: Non-Linear Gradient Normalization Optimizer
2025cites this paper
CGD: Modifying the Loss Landscape by Gradient Regularization
2025cites this paper
Exploring quantum control landscape and solution space complexity through optimization algorithms and dimensionality reduction
2025cites this paper
Optimization over Trained (and Sparse) Neural Networks: A Surrogate within a Surrogate
2025cites this paper
Evolutionary Dynamics of Stochastic Q Learning in Multi-Agent Systems
2025cites this paper
Inviscid information-embedded machine learning for the airfoil inverse mapping
2025cites this paper
VeLU: Variance-enhanced Learning Unit for Deep Neural Networks
2025cites this paper
Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection
2025cites this paper
Multi-teacher self-distillation based on adaptive weighting and activation pattern for enhancing lightweight arrhythmia recognition
2025cites this paper
An Innovative Multisource Teacher Collaborative Framework for Self-Knowledge Distillation
2025cites this paper
Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods
2025cites this paper
Optimization on multifractal loss landscapes explains a diverse range of geometrical and dynamical properties of deep learning
2025influential citation
Physics-Following Neural Network for Online Dynamic Security Assessment
2025cites this paper
Deep learning, transformers and graph neural networks: a linear algebra perspective
2025cites this paper
Plane-Wave Decomposition and Randomised Training; a Novel Path to Generalised PINNs for SHM
2025cites this paper
Towards understanding the optimization mechanisms in deep learning
2025cites this paper
Assisting Training of Deep Spiking Neural Networks With Parameter Initialization
2025cites this paper
Rigid-Deformation Decomposition AI Framework for 3D Spatio-Temporal Prediction of Vehicle Collision Dynamics
2025cites this paper
An unsupervised framework for dynamic health indicator construction and its application in rolling bearing prognostics
2025cites this paper
Generative Classifier for Domain Generalization
2025cites this paper
Deep Learning for Forensic Identification of Source
2025cites this paper