Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

Yang Wang,Chenghao Xiao,Yizhi Li,Stuart E. Middleton,N. A. Moubayed,Chen Lin

Published 2025 in arXiv.org

ABSTRACT

Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-07-29
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2507.21750 arXiv 2507.21750
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks
2025cited by this paper
Random Smooth-based Certified Defense against Text Adversarial Attack
2024cited by this paper
Qwen2 Technical Report
2024cited by this paper
GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
2024cited by this paper
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing
2024cited by this paper
RobustEmbed: Robust Sentence Embeddings Using Self-Supervised Contrastive Pre-Training
2023cited by this paper
Fooling the Textual Fooler via Randomizing Latent Representations
2023cited by this paper
[MASK] Insertion: a robust method for anti-adversarial attacks
2023cited by this paper
Similarizing the Influence of Words with Contrastive Learning to Defend Word-level Adversarial Text Attack
2023cited by this paper
Fantastic Expressions and Where to Find Them: Chinese Simile Generation with Multiple Constraints
2023cited by this paper
Finding Actual Descent Directions for Adversarial Training
2023cited by this paper
DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization
2023cited by this paper
Randomized Smoothing with Masked Inference for Adversarially Robust Text Classifications
2023cited by this paper
Bert is Robust! A Case Against Word Substitution-Based Adversarial Attacks
2023cited by this paper
Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend
2023cited by this paper
TextShield: Beyond Successfully Detecting Adversarial Sentences in Text Classification
2023cited by this paper
Simple Parameter-free Self-attention Approximation
2023cited by this paper
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
2023cited by this paper
Con-Detect: Detecting adversarially perturbed natural language inputs to deep classifiers through holistic analysis
2023cited by this paper
Towards a Robust Deep Neural Network Against Adversarial Texts: A Survey
2023cited by this paper
OPT: Open Pre-trained Transformer Language Models
2022cited by this paper
Disentangled Text Representation Learning With Information-Theoretic Perspective for Adversarial Robustness
2022cited by this paper
On Isotropy, Contextualization and Learning Dynamics of Contrastive-based Sentence Representation Learning
2022cited by this paper
In and Out-of-Domain Text Adversarial Robustness via Label Smoothing
2022influential reference
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
2022cited by this paper
Textual Manifold-based Defense Against Natural Language Adversarial Examples
2022cited by this paper
Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP
2022cited by this paper
Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding
2022cited by this paper
CodeAttack: Code-based Adversarial Attacks for Pre-Trained Programming Language Models
2022cited by this paper
Flooding-X: Improving BERT’s Resistance to Adversarial Attacks via Loss-Restricted Fine-Tuning
2022influential reference
Detection of Word Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation
2022cited by this paper
Fast Differentiable Matrix Square Root and Inverse Square Root
2022cited by this paper
Formalizing Generalization and Adversarial Robustness of Neural Networks to Weight Perturbations
2021cited by this paper
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
2021cited by this paper
Towards Robustness Against Natural Language Word Substitutions
2021cited by this paper
Shortcutted Commonsense: Data Spuriousness in Deep Learning of Commonsense Reasoning
2021cited by this paper
A Strong Baseline for Query Efficient Attacks in a Black Box Setting
2021cited by this paper
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality
2021cited by this paper
Searching for an Effective Defender: Benchmarking Defense against Adversarial Word Substitution
2021influential reference
Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble
2021cited by this paper
Token-Aware Virtual Adversarial Training in Natural Language Understanding
2021cited by this paper
Defending Pre-trained Language Models from Adversarial Word Substitution Without Performance Sacrifice
2021cited by this paper
Certified Robustness to Text Adversarial Attacks by Randomized [MASK]
2021influential reference
Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?
2021cited by this paper
WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach
2021cited by this paper
Whitening Sentence Representations for Better Semantics and Faster Retrieval
2021cited by this paper
Detecting textual adversarial examples through randomized substitution and vote
2021cited by this paper
Model Extraction and Adversarial Transferability, Your BERT is Vulnerable!
2021cited by this paper
TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP
2020cited by this paper
Do We Need Zero Training Loss After Achieving Zero Training Error?
2020cited by this paper
Improving Adversarial Robustness Requires Revisiting Misclassified Examples
2020cited by this paper
Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT
2020cited by this paper
Deep Learning Based Robust Text Classification Method via Virtual Adversarial Training
2020cited by this paper
BERT-ATTACK: Adversarial Attack against BERT Using BERT
2020cited by this paper
It’s Morphin’ Time! Combating Linguistic Discrimination with Inflectional Perturbations
2020cited by this paper
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions
2020influential reference
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
2020influential reference
TAVAT: Token-Aware Virtual Adversarial Training for Language Understanding.
2020cited by this paper
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective
2020influential reference
CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation
2020influential reference
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020cited by this paper
Training data-efficient image transformers & distillation through attention
2020cited by this paper
Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning
2020cited by this paper
Multi-Task Deep Neural Networks for Natural Language Understanding
2019influential reference
Defending Against Adversarial Attacks by Randomized Diversification
2019cited by this paper
FreeLB: Enhanced Adversarial Training for Natural Language Understanding
2019influential reference
Adversarial NLI: A New Benchmark for Natural Language Understanding
2019cited by this paper
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
2019cited by this paper
Toward Mitigating Adversarial Texts
2019cited by this paper
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment
2019influential reference
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019influential reference
Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency
2019influential reference
XLNet: Generalized Autoregressive Pretraining for Language Understanding
2019cited by this paper
Improving Neural Language Modeling via Adversarial Training
2019cited by this paper
Are Labels Required for Improving Adversarial Robustness?
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
Interpretable Adversarial Perturbation in Input Embedding Space for Text
2018cited by this paper
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
2018cited by this paper
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks
2018cited by this paper
TextBugger: Generating Adversarial Text Against Real-world Applications
2018cited by this paper
GENERATIVE ADVERSARIAL NETS
2018cited by this paper
On Calibration of Modern Neural Networks
2017cited by this paper
Decoupled Weight Decay Regularization
2017cited by this paper
Mitigating adversarial effects through randomization
2017cited by this paper
Towards Deep Learning Models Resistant to Adversarial Attacks
2017influential reference
All-but-the-Top: Simple and Effective Postprocessing for Word Representations
2017cited by this paper
Adversarial Training for Relation Extraction
2017cited by this paper
A Simple but Tough-to-Beat Baseline for Sentence Embeddings
2017influential reference
Adversarial Examples for Evaluating Reading Comprehension Systems
2017cited by this paper
FINDING STRUCTURE WITH RANDOMNESS : PROBABILISTIC ALGORITHMS FOR CONSTRUCTING
2016influential reference
Deep learning and the information bottleneck principle
2015cited by this paper
A Latent Variable Model Approach to PMI-based Word Embeddings
2015cited by this paper
Explaining and Harnessing Adversarial Examples
2014cited by this paper
A SICK cure for the evaluation of compositional distributional semantic models
2014influential reference
GloVe: Global Vectors for Word Representation
2014cited by this paper
Intriguing properties of neural networks
2013cited by this paper
Parsing with Compositional Vector Grammars
2013influential reference
Efficient Estimation of Word Representations in Vector Space
2013cited by this paper
48th Annual Meeting of the Association for Computational Linguistics
2010cited by this paper
Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions
2009influential reference

CITED BY

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
2025cites this paper