Gauss-Newton Unlearning for the LLM Era

Lev McKinney,Anvith Thudi,Juhan Bae,Tara Rezaei,Nicolas Papernot,Sheila A. McIlraith,Roger B. Grosse

Published 2026 in Unknown venue

ABSTRACT

Standard large language model training can create models that produce outputs their trainer deems unacceptable in deployment. The probability of these outputs can be reduced using methods such as LLM unlearning. However, unlearning a set of data (called the forget set) can degrade model performance on other distributions where the trainer wants to retain the model's behavior. To improve this trade-off, we demonstrate that using the forget set to compute only a few uphill Gauss-Newton steps provides a conceptually simple, state-of-the-art unlearning approach for LLMs. While Gauss-Newton steps adapt Newton's method to non-linear models, it is non-trivial to efficiently and accurately compute such steps for LLMs. Hence, our approach crucially relies on parametric Hessian approximations such as Kronecker-Factored Approximate Curvature (K-FAC). We call this combined approach K-FADE (K-FAC for Distribution Erasure). Our evaluation on the WMDP and ToFU benchmarks demonstrates that K-FADE suppresses outputs from the forget set and approximates, in output space, the results of retraining without the forget set. Critically, our method does this while altering the outputs on the retain set less than previous methods. This is because K-FADE transforms a constraint on the model's outputs across the entire retain set into a constraint on the model's weights, allowing the algorithm to minimally change the model's behavior on the retain set at each step. Moreover, the unlearning updates computed by K-FADE can be reapplied later if the model undergoes further training, allowing unlearning to be cheaply maintained.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-02-11
Fields of study
Computer Science
Identifiers
arXiv 2602.10568
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples
2025cited by this paper
Position: Curvature Matrices Should Be Democratized via Linear Operators
2025cited by this paper
On Evaluating the Durability of Safeguards for Open-Weight LLMs
2024cited by this paper
Attribute-to-Delete: Machine Unlearning via Datamodel Matching
2024cited by this paper
Do Unlearning Methods Remove Information from Language Model Weights?
2024cited by this paper
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
2024influential reference
Erasing Conceptual Knowledge from Language Models
2024influential reference
Tamper-Resistant Safeguards for Open-Weight LLMs
2024cited by this paper
MUSE: Machine Unlearning Six-Way Evaluation for Language Models
2024cited by this paper
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
2024cited by this paper
Representation Noising: A Defence Mechanism Against Harmful Finetuning
2024cited by this paper
SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning
2024cited by this paper
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
2024cited by this paper
Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models
2024cited by this paper
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
2024influential reference
Machine Unlearning of Pre-trained Large Language Models
2024cited by this paper
TOFU: A Task of Fictitious Unlearning for LLMs
2024influential reference
Rethinking machine unlearning for large language models
2024cited by this paper
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice
2024cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023cited by this paper
Poisoning Web-Scale Training Datasets is Practical
2023cited by this paper
Towards Unbounded Machine Unlearning
2023cited by this paper
TRAK: Attributing Model Behavior at Scale
2023cited by this paper
Model Sparsity Can Simplify Machine Unlearning
2023cited by this paper
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
2023cited by this paper
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
2023cited by this paper
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
2023influential reference
Gradients Look Alike: Sensitivity is Often Overestimated in DP-SGD
2023cited by this paper
Studying Large Language Model Generalization with Influence Functions
2023influential reference
Textbooks Are All You Need II: phi-1.5 technical report
2023influential reference
Mistral 7B
2023cited by this paper
Large Language Model Unlearning
2023cited by this paper
Continual Learning and Private Unlearning
2022cited by this paper
Locating and Editing Factual Associations in GPT
2022cited by this paper
Editing Models with Task Arithmetic
2022cited by this paper
Mass-Editing Memory in a Transformer
2022cited by this paper
Knowledge Unlearning for Mitigating Privacy Risks in Language Models
2022cited by this paper
If Influence Functions are the Answer, Then What is the Question?
2022cited by this paper
Are Large Pre-Trained Language Models Leaking Your Personal Information?
2022cited by this paper
On the Necessity of Auditable Algorithmic Definitions for Machine Unlearning
2021cited by this paper
Measuring Massive Multitask Language Understanding
2020influential reference
Approximate Data Deletion from Machine Learning Models: Algorithms and Evaluations
2020cited by this paper
Extracting Training Data from Large Language Models
2020cited by this paper
GLU Variants Improve Transformer
2020cited by this paper
Eigenvalue and Generalized Eigenvalue Problems: Tutorial
2019cited by this paper
Machine Unlearning
2019cited by this paper
Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks
2019cited by this paper
Certified Data Removal from Machine Learning Models
2019cited by this paper
PyTorch: An Imperative Style, High-Performance Deep Learning Library
2019cited by this paper
Metrics
2018cited by this paper
Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis
2018cited by this paper
A Robust and Efficient Implementation of LOBPCG
2017cited by this paper
Understanding Black-box Predictions via Influence Functions
2017cited by this paper
Distributed Second-Order Optimization using Kronecker-Factored Approximations
2016cited by this paper
Pointer Sentinel Mixture Models
2016cited by this paper
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
2015influential reference
New Insights and Perspectives on the Natural Gradient Method
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Deep learning via Hessian-free optimization
2010cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004cited by this paper
Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method
2001cited by this paper
Natural Gradient Works Efficiently in Learning
1998cited by this paper

CITED BY

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
2025cites this paper
Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
2025cites this paper