Adversarial NLI: A New Benchmark for Natural Language Understanding

Yixin Nie,Adina Williams,Emily Dinan,Mohit Bansal,J. Weston,Douwe Kiela

Published 2019 in Annual Meeting of the Association for Computational Linguistics

ABSTRACT

We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.

PUBLICATION RECORD

Publication year
2019
Venue
Annual Meeting of the Association for Computational Linguistics
Publication date
2019-10-31
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/2020.acl-main.441 arXiv 1910.14599
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

CONCEPTS

adversarial nli
dataset, benchmark

A large-scale NLI benchmark dataset collected via an iterative adversarial human-and-model-in-the-loop procedure introduced in this paper.

Aliases: ANLI

뀨 (7c402c1b98) extraction
human-and-model-in-the-loop
method

An iterative data collection procedure where human annotators and models interact adversarially to generate challenging examples.

뀨 (7c402c1b98) extraction
never-ending learning
learning paradigm

A continual learning scenario in which the benchmark evolves as a moving target rather than remaining a fixed static evaluation.

뀨 (7c402c1b98) extraction
nli benchmarks
evaluation setting

Existing natural language inference evaluation datasets used to measure model performance in this paper.

Aliases: NLI datasets

뀨 (7c402c1b98) extraction
nlu models
model

Neural models for natural language understanding whose weaknesses are probed by non-expert annotators in this paper.

Aliases: state-of-the-art models

뀨 (7c402c1b98) extraction

REFERENCES

Visual Question Answering: From Theory to Application
2022cited by this paper
Adversarial Filters of Dataset Biases
2020cited by this paper
No Training Required: Exploring Random Encoders for Sentence Classification
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
2019cited by this paper
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
2019cited by this paper
Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack
2019cited by this paper
Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study
2019cited by this paper
Abductive Commonsense Reasoning
2019cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019influential reference
HellaSwag: Can a Machine Really Finish Your Sentence?
2019cited by this paper
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
2019cited by this paper
Learning from On-Line User Feedback in Neural Question Answering on the Web
2019cited by this paper
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
2019influential reference
Multi-Task Deep Neural Networks for Natural Language Understanding
2019influential reference
XLNet: Generalized Autoregressive Pretraining for Language Understanding
2019cited by this paper
Trick Me If You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples
2019cited by this paper
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
2018influential reference
Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment
2018cited by this paper
Stress Test Evaluation for Natural Language Inference
2018influential reference
Trick Me If You Can: Adversarial Writing of Trivia Challenge Questions
2018cited by this paper
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
2018cited by this paper
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
2018cited by this paper
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
2018cited by this paper
Do CIFAR-10 Classifiers Generalize to CIFAR-10?
2018cited by this paper
Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences
2018cited by this paper
Combining Fact Extraction and Verification with Neural Semantic Matching Networks
2018cited by this paper
Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering
2018cited by this paper
Breaking NLI Systems with Sentences that Require Simple Lexical Inferences
2018influential reference
SentEval: An Evaluation Toolkit for Universal Sentence Representations
2018influential reference
Annotation Artifacts in Natural Language Inference Data
2018influential reference
FEVER: a Large-scale Dataset for Fact Extraction and VERification
2018cited by this paper
Hypothesis Only Baselines in Natural Language Inference
2018influential reference
Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task
2017influential reference
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
2017cited by this paper
Synthetic and Natural Noise Both Break Neural Machine Translation
2017cited by this paper
Reading Wikipedia to Answer Open-Domain Questions
2017cited by this paper
Teaching Machines to Describe Images via Natural Language Feedback
2017cited by this paper
Mastering the Dungeon: Grounded Language Learning by Mechanical Turker Descent
2017cited by this paper
A Continuously Growing Dataset of Sentential Paraphrases
2017cited by this paper
Adversarial Examples for Evaluating Reading Comprehension Systems
2017cited by this paper
Natural Language Inference over Interaction Space
2017cited by this paper
Shortcut-Stacked Sentence Encoders for Multi-Domain Inference
2017cited by this paper
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
2017cited by this paper
Teaching Machines to Describe Images with Natural Language Feedback
2017cited by this paper
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2016cited by this paper
Build It, Break It, Fix It: Contesting Secure Development
2016cited by this paper
The LAMBADA dataset: Word prediction requiring a broad discourse context
2016cited by this paper
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
2016cited by this paper
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016cited by this paper
VQA: Visual Question Answering
2015cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Exploring Nearest Neighbor Approaches for Image Captioning
2015cited by this paper
The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations
2015influential reference
A large annotated corpus for learning natural language inference
2015influential reference
Never-Ending Learning
2015cited by this paper
ImageNet Large Scale Visual Recognition Challenge
2014influential reference
Regularization of Neural Networks using DropConnect
2013cited by this paper
Multi-column deep neural networks for image classification
2012cited by this paper
The Seventh PASCAL Recognizing Textual Entailment Challenge
2011cited by this paper
ImageNet: A large-scale hierarchical image database
2009cited by this paper
The Sixth PASCAL Recognizing Textual Entailment Challenge
2009cited by this paper
Gradient-based learning applied to document recognition
1998cited by this paper
The Tragedy of Hamlet, Prince of Denmark
1985cited by this paper

CITED BY

DNF: Dual-Layer Nested Fingerprinting for Large Language Model Intellectual Property Protection
2026cites this paper
CATTO: Balancing Preferences and Confidence in Language Models
2026influential citation
Overton Pluralistic Reinforcement Learning for Large Language Models
2026cites this paper
ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models
2026cites this paper
Evaluating Robustness and Generalization in LLMs under Adversarial and Real-World Conditions
2026cites this paper
Inhibitory Attacks on Backdoor-based Fingerprinting for Large Language Models
2026cites this paper
CoSA: Compressed Sensing-Based Adaptation of Large Language Models
2026cites this paper
Secure Semantic Communications via AI Defenses: Fundamentals, Solutions, and Future Directions
2026cites this paper
Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models
2026cites this paper
EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning
2026cites this paper
Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning
2026cites this paper
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation
2026cites this paper
Fine-Grained Model Merging via Modular Expert Recombination
2026cites this paper
From coarse to fine-grained decomposition: Hierarchical question generation and learned importance for automated fact-checking
2026cites this paper
Retrieval augmentation for out-of-distribution robustness in non-knowledge intensive in-context learning
2026cites this paper
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
2026cites this paper
Assessing LLM Reliability on Temporally Recent Open-Domain Questions
2026cites this paper
From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning
2026cites this paper
LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules
2026cites this paper
D2A2: Enhancing LLM knowledge distillation efficiency and performance with difficulty-aware and adaptive distillation framework
2026cites this paper
KinGuard: Hierarchical Kinship-Aware Fingerprinting to Defend Against Large Language Model Stealing
2026cites this paper
Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference
2026cites this paper
Retrieve-Refine-Calibrate: A Framework for Complex Claim Fact-Checking
2026cites this paper
OSNIP: Breaking the Privacy-Utility-Efficiency Trilemma in LLM Inference via Obfuscated Semantic Null Space
2026cites this paper
Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models
2026cites this paper
Augmenting Small Language Model for Better Medical Question Answering through Source Authentication
2026cites this paper
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
2026cites this paper
Krause Synchronization Transformers
2026cites this paper
AdNanny: One Reasoning LLM for All Offline Ads Recommendation Tasks
2026cites this paper
Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference?
2025cites this paper
SAGE: A Context-Aware Approach for Mining Privacy Requirements Relevant Reviews from Mental Health Apps
2025cites this paper
A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation
2025cites this paper
On the Effect of Uncertainty on Layer-wise Inference Dynamics
2025cites this paper
Structured Discourse Representation for Factual Consistency Verification
2025influential citation
CMER: A Context-Aware Approach for Mining Ethical Concern-related App Reviews
2025cites this paper
Improving Efficiency in Large Language Models via Extendable Block Floating Point Representation
2025cites this paper
QueueEDIT: Structural Self-Correction for Sequential Model Editing in LLMs
2025cites this paper
Red teaming large language models: A comprehensive review and critical analysis
2025cites this paper
PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models
2025cites this paper
You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Model
2025cites this paper
NeurIPS 2025 E2LM Competition : Early Training Evaluation of Language Models
2025cites this paper
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers
2025influential citation
From Seed to Harvest: Augmenting Human Creativity with AI for Red-teaming Text-to-Image Models
2025cites this paper
Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal
2025cites this paper
Improving the OOD Performance of Closed-Source LLMs on NLI Through Strategic Data Selection
2025cites this paper
Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer
2025cites this paper
A Comparative Performance Analysis of Locally Deployed Large Language Models Through a Retrieval-Augmented Generation Educational Assistant Application for Textual Data Extraction
2025cites this paper
SUCEA: Reasoning-Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing
2025cites this paper
Probabilistic distances-based hallucination detection in LLMs with RAG
2025cites this paper
DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts
2025cites this paper
MEraser: An Effective Fingerprint Erasure Approach for Large Language Models
2025cites this paper
Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches
2025cites this paper
Beyond the Benchmark: A Customizable Platform for Real-Time, Preference-Driven LLM Evaluation
2025cites this paper
AI-Powered Assessment of Resistance to Change in the Context of Digital Transformation
2025cites this paper
Diffusion Beats Autoregressive in Data-Constrained Settings
2025cites this paper
AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs
2025cites this paper
Less Mature is More Adaptable for Sentence-level Language Modeling
2025cites this paper
OBELLA: Open the Book for Evaluating Long-Form Large Language Model Answers in Open-Domain Question Answering
2025cites this paper
Train-before-Test Harmonizes Language Model Rankings
2025cites this paper
Towards a Principled Evaluation of Knowledge Editors
2025cites this paper
Towards Compute-Optimal Many-Shot In-Context Learning
2025cites this paper
Agent Identity Evals: Measuring Agentic Identity
2025cites this paper
Mitigating error propagation in multi-hop fact verification with logic reasoning
2025cites this paper
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
2025cites this paper
Domain Regeneration: How well do LLMs match syntactic properties of text domains?
2025cites this paper
When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction
2025cites this paper
Navigating the Accuracy-Size Trade-Off with Flexible Model Merging
2025cites this paper
Pushing the boundary on Natural Language Inference
2025cites this paper
LoCal: Logical and Causal Fact-Checking with LLM-Based Multi-Agents
2025cites this paper
Do Entailment Models know about Reasoning Temporal Ordering on Clinical Texts?
2025cites this paper
From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs
2025cites this paper
Debate-Feedback: A Multi-Agent Framework for Efficient Legal Judgment Prediction
2025cites this paper
Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models
2025cites this paper
AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models
2025cites this paper
LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference
2025cites this paper
An Efficient Plugin Method for Metric Optimization of Black-Box Models
2025cites this paper
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning
2025influential citation
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
2025cites this paper
J&H: Evaluating the Robustness of Large Language Models Under Knowledge-Injection Attacks in Legal Domain
2025cites this paper
Disentangling Reasoning Factors for Natural Language Inference
2025cites this paper
Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness
2025cites this paper
Myanmar XNLI: building a dataset and exploring low-resource approaches to natural language inference with Myanmar
2025cites this paper
CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs
2025cites this paper
aiXamine: Simplified LLM Safety and Security
2025cites this paper
FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking
2025influential citation
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
2025cites this paper
Evaluating Numeracy of Language Models as a Natural Language Inference Task
2025cites this paper
Always Tell Me The Odds: Fine-grained Conditional Probability Estimation
2025influential citation
Analog Foundation Models
2025cites this paper
Forest for the Trees: Overarching Prompting Evokes High-Level Reasoning in Large Language Models
2025influential citation
AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners
2025cites this paper
Generalizable Process Reward Models via Formally Verified Training Data
2025influential citation
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
2025cites this paper
Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning
2025cites this paper
Enhancing logical reasoning in language models: An investigation of the Capybara dataset
2025cites this paper
Exploring Explanations Improves the Robustness of In-Context Learning
2025cites this paper
A MISMATCHED Benchmark for Scientific Natural Language Inference
2025cites this paper
Tau-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization
2025influential citation
Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning
2025cites this paper
A Systematic Survey of Automatic Prompt Optimization Techniques
2025cites this paper