Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

T. Garipov,Pavel Izmailov,Dmitrii Podoprikhin,D. Vetrov,A. Wilson

Published 2018 in Neural Information Processing Systems

ABSTRACT

The loss functions of deep neural networks are complex and their geometric properties are not well understood. We show that the optima of these complex loss functions are in fact connected by simple curves over which training and test accuracy are nearly constant. We introduce a training procedure to discover these high-accuracy pathways between modes. Inspired by this new geometric insight, we also propose a new ensembling method entitled Fast Geometric Ensembling (FGE). Using FGE we can train high-performing ensembles in the time required to train a single model. We achieve improved performance compared to the recent state-of-the-art Snapshot Ensembles, on CIFAR-10, CIFAR-100, and ImageNet.

PUBLICATION RECORD

Publication year
2018
Venue
Neural Information Processing Systems
Publication date
2018-02-27
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 1802.10026
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Using Mode Connectivity for Loss Landscape Analysis
2018cited by this paper
Averaging Weights Leads to Wider Optima and Better Generalization
2018cited by this paper
Essentially No Barriers in Neural Network Energy Landscape
2018cited by this paper
Visualizing the Loss Landscape of Neural Nets
2017cited by this paper
Snapshot Ensembles: Train 1, get M for free
2017influential reference
On Calibration of Modern Neural Networks
2017cited by this paper
Exploring loss function topology with cyclical learning rates
2017influential reference
Sharp Minima Can Generalize For Deep Nets
2017cited by this paper
Wide Residual Networks
2016cited by this paper
Gradient Descent Only Converges to Minimizers
2016cited by this paper
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
2016cited by this paper
Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles
2016cited by this paper
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
2016cited by this paper
Topology and Geometry of Half-Rectified Network Optimization
2016cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2015cited by this paper
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014influential reference
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
2014cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014influential reference
Qualitatively characterizing neural network optimization problems
2014cited by this paper
The Loss Surfaces of Multilayer Networks
2014cited by this paper
Horizontal and Vertical Ensemble with Deep Representation for Classification
2013cited by this paper
Nudged elastic band method for finding minimum energy paths of transitions
1998cited by this paper
Flat Minima
1997cited by this paper
Exponentially many local minima for single neurons
1995cited by this paper
Building a Large Annotated Corpus of English: The Penn Treebank
1993cited by this paper

CITED BY

ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits
2026cites this paper
MERGETUNE: Continued fine-tuning of vision-language models
2026influential citation
Model Agreement via Anchoring
2026cites this paper
Essentially No Energy Barrier Between Independent Fermionic Neural Quantum State Minima
2026cites this paper
The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging
2026cites this paper
Training Memory in Deep Neural Networks: Mechanisms, Evidence, and Measurement Gaps
2026influential citation
Manifold-Aware Temporal Domain Generalization for Large Language Models
2026cites this paper
Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts
2026cites this paper
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
2026cites this paper
Why are there many equally good models? An Anatomy of the Rashomon Effect
2026cites this paper
M-Loss: Quantifying Model Merging Compatibility with Limited Unlabeled Data
2026cites this paper
Mapping Networks
2026cites this paper
The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure
2026cites this paper
Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization
2026cites this paper
Depth, Not Data: An Analysis of Hessian Spectral Bifurcation
2026cites this paper
Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks
2026cites this paper
Communication-Efficient Personalized Adaptation via Federated-Local Model Merging
2026cites this paper
Relatron: Automating Relational Machine Learning over Relational Databases
2026cites this paper
CF-STAR: Highly compressible adapters for model merging via centralized task vectors
2026cites this paper
SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer
2026cites this paper
Visualizing the loss landscapes of physics-informed neural networks
2026cites this paper
Transient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent
2026cites this paper
Rethinking LoRA for Data Heterogeneous Federated Learning: Subspace and State Alignment
2026cites this paper
Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate
2026cites this paper
Riemannian Dueling Optimization
2026cites this paper
Leap+Verify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training
2026cites this paper
Transformers converge to invariant algorithmic cores
2026cites this paper
Neural network optimization strategies and the topography of the loss landscape
2026cites this paper
Quantifying LLM Attention-Head Stability: Implications for Circuit Universality
2026cites this paper
Model soups need only one ingredient
2026cites this paper
Aggregation on Learnable Manifolds for Asynchronous Federated Optimization
2025cites this paper
How does the optimizer implicitly bias the model merging loss landscape?
2025cites this paper
Understanding the Effects of Domain Finetuning on LLMs
2025cites this paper
Non-Linear Trajectory Modeling for Multi-Step Gradient Inversion Attacks in Federated Learning
2025influential citation
Research on Understanding and Improving Deep Hashing Retrieval under Incremental Data
2025cites this paper
Closing the Oracle Gap: Increment Vector Transformation for Class Incremental Learning
2025cites this paper
Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models
2025cites this paper
Interplay between Bayesian neural networks and deep learning: A survey
2025cites this paper
Enhanced Predictive Modeling for Anomaly Detection in Financial Transactions Using Machine Learning
2025cites this paper
Harnessing Optimization Dynamics for Curvature-Informed Model Merging
2025cites this paper
Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown
2025cites this paper
Walking on the Fiber: A Simple Geometric Approximation for Bayesian Neural Networks
2025influential citation
Ensemble-Based Fish Species Recognition in Challenging Underwater Environments
2025cites this paper
Exploring the Relationship between Brain Hemisphere States and Frequency Bands through Deep Learning Optimization Techniques
2025cites this paper
Structure of solutions to continuous constraint satisfaction problems through the statistics of wedged and inscribed spheres
2025cites this paper
Circumventing Backdoor Space via Weight Symmetry
2025influential citation
Uncertainty in Deep Learning for EEG under Dataset Shifts
2025cites this paper
Forgetting of task-specific knowledge in model merging-based continual learning
2025influential citation
Learning from Oblivion: Predicting Knowledge Overflowed Weights via Retrodiction of Forgetting
2025cites this paper
SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures
2025cites this paper
Spectral momentum integration: hybrid optimization of frequency and time domain gradients
2025cites this paper
Attentive neural networks and meta-learning integration for revolutionary vehicular engine health monitoring
2025cites this paper
Distribution Shift Aware Neural Tabular Learning
2025cites this paper
Characterizing Fitness Landscape Structures in Prompt Engineering
2025cites this paper
Pre-training under infinite compute
2025cites this paper
Federated Domain Generalization with Decision Insight Matrix
2025cites this paper
Physics-Informed Neuro-Evolution (PINE): A Survey and Prospects
2025cites this paper
The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging
2025cites this paper
Sharpness-Aware Minimization Can Hallucinate Minimizers
2025influential citation
Categorical Invariants of Learning Dynamics
2025cites this paper
Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates
2025cites this paper
Do We Really Need Permutations? Impact of Model Width on Linear Mode Connectivity
2025cites this paper
Benchmarking Clustered Federated Learning Algorithms for Next-Point Prediction
2025cites this paper
Symmetry in Neural Network Parameter Spaces
2025influential citation
Benignity of loss landscape with weight decay requires both large overparametrization and initialization
2025cites this paper
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
2025cites this paper
Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking
2025cites this paper
The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions
2025cites this paper
Virtual neural networks: hundreds of souls in a body
2025cites this paper
The effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks
2025cites this paper
MCU: Improving Machine Unlearning through Mode Connectivity
2025influential citation
The Intrinsic Dimension of Neural Network Ensembles
2025cites this paper
LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks
2025cites this paper
Seeking Flat Minima over Diverse Surrogates for Improved Adversarial Transferability: A Theoretical Framework and Algorithmic Instantiation
2025cites this paper
PEER pressure: Model-to-Model Regularization for Single Source Domain Generalization
2025cites this paper
Flat Channels to Infinity in Neural Loss Landscapes
2025cites this paper
Adiabatic Fine-Tuning of Neural Quantum States Enables Detection of Phase Transitions in Weight Space
2025cites this paper
On Local Posterior Structure in Deep Ensembles
2025cites this paper
Understanding Machine Unlearning Through the Lens of Mode Connectivity
2025influential citation
Boosting-inspired online learning with transfer for railway maintenance
2025cites this paper
Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning
2025cites this paper
A Combinatorial Theory of Dropout: Subnetworks, Graph Geometry, and Generalization
2025cites this paper
Data-Adaptive Weight-Ensembling for Multi-task Model Fusion
2025cites this paper
Dynamic Fisher-weighted Model Merging via Bayesian Optimization
2025cites this paper
Connecting Independently Trained Modes via Layer-Wise Connectivity
2025influential citation
Epistemic Artificial Intelligence is Essential for Machine Learning Models to Truly'Know When They Do Not Know'
2025cites this paper
Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging
2025cites this paper
Unveiling the Basin-Like Loss Landscape in Large Language Models
2025cites this paper
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs
2025cites this paper
Exploring the Hidden Capacity of LLMs for One-Step Text Generation
2025cites this paper
SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training
2025cites this paper
Update Your Transformer to the Latest Release: Re-Basin of Task Vectors
2025cites this paper
Frequentist uncertainties on neural density ratios with <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mtext> </mml:mtext><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math> ensembles
2025cites this paper
A Tale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models
2025cites this paper
Toward Efficient Federated Load Forecasting: Personalization Mechanisms and Their Impact
2025cites this paper
SE-Merging: A Self-Enhanced Approach for Dynamic Model Merging
2025cites this paper
Generalized Linear Mode Connectivity for Transformers
2025cites this paper
How Weight Resampling and Optimizers Shape the Dynamics of Continual Learning and Forgetting in Neural Networks
2025cites this paper
ArnoldiGCL: Graph Contrastive Learning via Learnable Arnoldi-Based Guided Spectral Chebyshev Polynomial Filters
2025cites this paper
Finding Stable Subnetworks at Initialization with Dataset Distillation
2025cites this paper