Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

Pradeep Dasigi,Nelson F. Liu,Ana Marasović,Noah A. Smith,Matt Gardner

Published 2019 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Machine comprehension of texts longer than a single sentence often requires coreference resolution. However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail to evaluate the ability of models to resolve coreference. We present a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia. Obtaining questions focused on such phenomena is challenging, because it is hard to avoid lexical cues that shortcut complex reasoning. We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark—the best model performance is 70.5 F1, while the estimated human performance is 93.4 F1.

PUBLICATION RECORD

Publication year
2019
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
Unknown publication date
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/D19-1606 arXiv 1908.05803
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets
2019cited by this paper
Model-based Annotation of Coreference
2019cited by this paper
A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation
2019cited by this paper
XLNet: Generalized Autoregressive Pretraining for Language Understanding
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019influential reference
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
2019influential reference
Natural Questions: A Benchmark for Question Answering Research
2019cited by this paper
PreCo: A Large-scale Dataset in Preschool Vocabulary for Coreference Resolution
2018cited by this paper
AllenNLP: A Deep Semantic Natural Language Processing Platform
2018cited by this paper
Annotation Artifacts in Natural Language Inference Data
2018cited by this paper
QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension
2018cited by this paper
Anaphora Resolution with the ARRAU Corpus
2018cited by this paper
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
2018cited by this paper
RACE: Large-scale ReAding Comprehension Dataset From Examinations
2017influential reference
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
2017cited by this paper
Simple and Effective Multi-Paragraph Reading Comprehension
2017cited by this paper
Attention is All you Need
2017cited by this paper
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
2016cited by this paper
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016influential reference
WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles
2016cited by this paper
Removing the Training Wheels: A Coreference Dataset that Entertains Humans and Challenges Computers
2015cited by this paper
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books
2015cited by this paper
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
2013cited by this paper
Annotated Gigaword
2012cited by this paper
Identity, non-identity, and near-identity: Addressing the complexity of coreference
2011cited by this paper
Experiments with ClueWeb09: Relevance Feedback and Web Tracks
2009cited by this paper
Vagueness and Referential Ambiguity in a Large-Scale Annotated Corpus
2008cited by this paper
Elements of a National SemanticWeb Infrastructure--Case Study Finland on the Semantic Web
2007cited by this paper

CITED BY

Moving Toward a Reader-Centred Classroom: Students’ Reflection on Reading
2025cites this paper
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
2025cites this paper
SMARTMiner: Extracting and Evaluating SMART Goals from Low-Resource Health Coaching Notes
2025cites this paper
A new benchmark dataset and mixture-of-experts language models for adversarial natural language inference in Vietnamese
2025cites this paper
Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner
2025cites this paper
AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play
2025cites this paper
ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese
2025cites this paper
Testing Question Answering Software with Context-Driven Question Generation
2025cites this paper
None of the above: comparing scenarios for answerability detection in question answering systems
2025cites this paper
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
2025cites this paper
Unlocking Speech Instruction Data Potential with Query Rewriting
2025influential citation
AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs
2025influential citation
Large language models-based metric for generative question answering systems
2025cites this paper
Paraphrasing in Affirmative Terms Improves Negation Understanding
2024cites this paper
DAPT: A Dual Attention Framework for Parameter-Efficient Continual Learning of Large Language Models
2024cites this paper
SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains
2024cites this paper
Continuous Risk Prediction
2024cites this paper
QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory
2024cites this paper
Maverick: Efficient and Accurate Coreference Resolution Defying Recent Trends
2024cites this paper
ChatQA: Building GPT-4 Level Conversational QA Models
2024cites this paper
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
2024cites this paper
Multi-Task Learning with Adapters for Plausibility Prediction: Bridging the Gap or Falling into the Trenches?
2024cites this paper
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory
2024cites this paper
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
2024cites this paper
Auto-Grading Comprehension on Reference-Student Answer Pairs using the Siamese-based Transformer
2024cites this paper
Numerical reasoning reading comprehension on Vietnamese COVID-19 news: task, corpus, and challenges
2024cites this paper
Enhancing Pre-Trained Generative Language Models with Question Attended Span Extraction on Machine Reading Comprehension
2024influential citation
Continual Learning of Large Language Models: A Comprehensive Survey
2024influential citation
QASE Enhanced PLMs: Improved Control in Text Generation for MRC
2024influential citation
Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?
2024cites this paper
A Dataset of Open-Domain Question Answering with Multiple-Span Answers
2024influential citation
Harnessing PubMed User Query Logs for Post Hoc Explanations of Recommended Similar Articles
2024cites this paper
ChatQA: Surpassing GPT-4 on Conversational QA and RAG
2024cites this paper
SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models
2024cites this paper
LIQUID: A Framework for List Question Answering Dataset Generation
2023influential citation
SMDDH: Singleton Mention Detection using Deep Learning in Hindi Text
2023cites this paper
A brief survey on recent advances in coreference resolution
2023cites this paper
Beyond Output Matching: Bidirectional Alignment for Enhanced In-Context Learning
2023cites this paper
ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?
2023influential citation
Releasing the CRaQAn (Coreference Resolution in Question-Answering): An open-source dataset and dataset creation methodology using instruction-following models
2023cites this paper
What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception
2023influential citation
KTRL+F: Knowledge-Augmented In-Document Search
2023cites this paper
Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
2023cites this paper
Instructive Dialogue Summarization with Query Aggregations
2023cites this paper
InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining
2023influential citation
Knowledgeable In-Context Tuning: Exploring and Exploiting Factual Knowledge for In-Context Learning
2023cites this paper
Limitations of Open-Domain Question Answering Benchmarks for Document-level Reasoning
2023cites this paper
Causal Intervention for Mitigating Name Bias in Machine Reading Comprehension
2023cites this paper
Complex Reasoning in Natural Languag
2023cites this paper
FiD-ICL: A Fusion-in-Decoder Approach for Efficient In-Context Learning
2023cites this paper
Dual Cache for Long Document Neural Coreference Resolution
2023cites this paper
End-to-End Learning with Text & Knowledge Bases
2023cites this paper
Boosting In-Context Learning with Factual Knowledge
2023cites this paper
Ellipsis-Dependent Reasoning: a New Challenge for Large Language Models
2023cites this paper
Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
2023cites this paper
Enhancing In-Context Learning with Answer Feedback for Multi-Span Question Answering
2023influential citation
How Many Answers Should I Give? An Empirical Study of Multi-Answer Reading Comprehension
2023cites this paper
FERMAT: An Alternative to Accuracy for Numerical Reasoning
2023cites this paper
Few-shot Unified Question Answering: Tuning Models or Prompts?
2023cites this paper
Prompting with Pseudo-Code Instructions
2023cites this paper
Pre-Training to Learn in Context
2023cites this paper
Document Understanding Dataset and Evaluation (DUDE)
2023cites this paper
Long-Tailed Question Answering in an Open World
2023influential citation
Pachinko: Patching Interpretable QA Models through Natural Language Feedback
2023influential citation
Domain Incremental Lifelong Learning in an Open World
2023cites this paper
Picking the Underused Heads: A Network Pruning Perspective of Attention Head Selection for Fusing Dialogue Coreference Information
2023cites this paper
Does Noise Really Matter? Investigation into the Influence of Noisy Labels on Bert-Based Question Answering System
2023cites this paper
Challenges to Evaluating the Generalization of Coreference Resolution Models: A Measurement Modeling Perspective
2023cites this paper
Learning to Initialize: Can Meta Learning Improve Cross-task Generalization in Prompt Tuning?
2023cites this paper
Analyzing the Effectiveness of the Underlying Reasoning Tasks in Multi-hop Question Answering
2023cites this paper
Lifelong Learning for Question Answering with Hierarchical Prompts
2022cites this paper
Reasoning Like Program Executors
2022cites this paper
UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training
2022cites this paper
ConTinTin: Continual Learning from Task Instructions
2022cites this paper
Automatically Solving Elementary Science Questions: A Survey
2022influential citation
UKP-SQUARE: An Online Platform for Question Answering Research
2022influential citation
Polyglot Prompt: Multilingual Multitask Prompt Training
2022influential citation
Don’t Blame the Annotator: Bias Already Starts in the Annotation Instructions
2022influential citation
Improving In-Context Few-Shot Learning via Self-Supervised Training
2022cites this paper
ProQA: Structural Prompt-based Pre-training for Unified Question Answering
2022cites this paper
Prompt Tuning for Discriminative Pre-trained Language Models
2022cites this paper
LogiGAN: Learning Logical Reasoning via Adversarial Pre-training
2022cites this paper
On Measuring Social Biases in Prompt-Based Multi-Task Learning
2022cites this paper
Eliciting Transferability in Multi-task Learning with Task-level Mixture-of-Experts
2022cites this paper
Graph convolutional networks in language and vision: A survey
2022cites this paper
KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in Few-Shot NLP
2022cites this paper
Enhanced Story Comprehension for Large Language Models through Dynamic Document-Based Knowledge Graphs
2022influential citation
MultiSpanQA: A Dataset for Multi-Span Question Answering
2022cites this paper
Few-shot Adaptation Works with UnpredicTable Data
2022cites this paper
F-coref: Fast, Accurate and Easy to Use Coreference Resolution
2022cites this paper
Machine Reading, Fast and Slow: When Do Models “Understand” Language?
2022influential citation
CorefDiffs: Co-referential and Differential Knowledge Flow in Document Grounded Conversations
2022cites this paper
Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model
2022cites this paper
Evaluating Coreference Resolvers on Community-based Question Answering: From Rule-based to State of the Art
2022cites this paper
CMQA: A Dataset of Conditional Question Answering with Multiple-Span Answers
2022cites this paper
Learning Instructions with Unlabeled Data for Zero-Shot Cross-Task Generalization
2022cites this paper
Continuous QA Learning with Structured Prompts
2022cites this paper
Different Tunes Played with Equal Skill: Exploring a Unified Optimization Subspace for Delta Tuning
2022cites this paper
CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation
2022cites this paper
RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question
2022cites this paper