NewsQA: A Machine Comprehension Dataset

Adam Trischler,Tong Wang,Xingdi Yuan,Justin Harris,Alessandro Sordoni,Philip Bachman,Kaheer Suleman

Published 2016 in Rep4NLP@ACL

ABSTRACT

We present NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs. Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting of spans of text in the articles. We collect this dataset through a four-stage process designed to solicit exploratory questions that require reasoning. Analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment. We measure human performance on the dataset and compare it to several strong neural models. The performance gap between humans and machines (13.3% F1) indicates that significant progress can be made on NewsQA through future research. The dataset is freely available online.

PUBLICATION RECORD

Publication year
2016
Venue
Rep4NLP@ACL
Publication date
2016-11-29
Fields of study
Computer Science
Identifiers
DOI 10.18653/v1/W17-2623 arXiv 1611.09830
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

LINKED PAPERS

SQuAD: 100,000+ Questions for Machine Comprehension of Text
20161 semantic links1 concept links0 claim links
- reading comprehension related to · NewsQA is described as a machine comprehension dataset of question-answer pairs with span answers, which matches the reading-comprehension task of answering questions grounded in a passage.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
20191 semantic links1 concept links0 claim links
- reading comprehension related to · NewsQA is described as a machine comprehension dataset of question-answer pairs with span answers, which matches the reading-comprehension task of answering questions grounded in a passage.

CLAIMS

The performance gap between humans and machines on NewsQA is 13.3% F1, relative to strong neural models.
Confidence 0.98

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review
NewsQA demands abilities beyond simple word matching and recognizing textual entailment.
Confidence 0.95

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review
A four-stage process was designed to solicit exploratory questions that require reasoning.
Confidence 0.97

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review
NewsQA is a freely available machine comprehension dataset with over 100,000 human-generated question-answer pairs from more than 10,000 CNN news articles, with answers as text spans.
Confidence 0.99

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review

CONCEPTS

exploratory questions
question type

Questions intended to probe article understanding and require reasoning rather than direct lookup.

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review
f1 score
metric

The overlap-based evaluation metric used to compare human and model answers.

Aliases: F1

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review
four-stage process
process

The four-step annotation workflow used to collect the dataset from crowdworkers.

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review
newsqa
dataset

A machine comprehension dataset of over 100,000 human-generated question-answer pairs from CNN news articles, with answers selected as text spans.

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review
performance gap between humans and machines
result

The measured difference in F1 between human answers and model answers on NewsQA.

Aliases: performance gap

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review
simple word matching and recognizing textual entailment
ability

Shallow lexical matching and entailment-style inference abilities mentioned as comparison points in the analysis.

Aliases: word matching, textual entailment

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review
strong neural models
model

Neural machine comprehension systems used as comparison baselines in the paper.

Aliases: neural models

뀨 (7c402c1b98) extractionAnonymous (12632b8b5f) review

REFERENCES

FastQA: A Simple and Efficient Neural Architecture for Question Answering
2017cited by this paper
Natural Language Comprehension with the EpiReader
2016cited by this paper
Machine Comprehension Using Match-LSTM and Answer Pointer
2016cited by this paper
A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
2016influential reference
Multi-Perspective Context Matching for Machine Comprehension
2016influential reference
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
2016cited by this paper
Text Understanding with the Attention Sum Reader Network
2016influential reference
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016influential reference
A Parallel-Hierarchical Model for Machine Comprehension on Sparse Data
2016cited by this paper
Iterative Alternating Neural Attention for Machine Reading
2016cited by this paper
Embracing data abundance: BookTest Dataset for Reading Comprehension
2016influential reference
Teaching Machines to Read and Comprehend
2015cited by this paper
Machine Comprehension with Syntax, Frames, and Semantics
2015cited by this paper
Under review as a conference paper at ICLR 2016
2015cited by this paper
Learning Natural Language Inference with LSTM
2015influential reference
The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations
2015influential reference
Learning Answer-Entailing Structures for Machine Comprehension
2015cited by this paper
GloVe: Global Vectors for Word Representation
2014cited by this paper
CIDEr: Consensus-based image description evaluation
2014influential reference
Neural Machine Translation by Jointly Learning to Align and Translate
2014cited by this paper
Adam: A Method for Stochastic Optimization
2014cited by this paper
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
2013cited by this paper
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
2013influential reference
On the difficulty of training recurrent neural networks
2012cited by this paper
Understanding the difficulty of training deep feedforward neural networks
2010cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper

CITED BY

Reinforcement Learning from Meta-Evaluation: Aligning Language Models Without Ground-Truth Labels
2026cites this paper
SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass
2026cites this paper
MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News
2026cites this paper
Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models
2026cites this paper
Efficient Context Filtering for Extractive Question Answering: A Hybrid Approach with Semantic Validation
2026cites this paper
MCQs Generation With Large Language Models: A Survey of Methodologies, Evolution, and Open Research Issues
2026cites this paper
HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark
2025cites this paper
Maximizing discrimination masking for faithful question answering with machine reading
2025cites this paper
HalluLens: LLM Hallucination Benchmark
2025cites this paper
Pt-HotpotQA: Evaluating Multi-Hop Question Answering on Original and Portuguese-translated Datasets Using LLMs
2025cites this paper
Small Encoders Can Rival Large Decoders in Detecting Groundedness
2025cites this paper
T2: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering
2025cites this paper
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
2025cites this paper
A resilient generative model in few-shot question answering
2025cites this paper
CCFBQGD: Chinese Cross-Domain Fill-in-the-Blank Question Generation Dataset and Its Benchmarking Methodology
2025cites this paper
MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference
2025cites this paper
Resolving passage ambiguity in machine reading comprehension using lightweight transformer architectures
2025influential citation
AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering
2025influential citation
Dementiaguard - Integrated Digital Healthcare Solution for Dementia Care
2025cites this paper
SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG
2025cites this paper
Optimizing resume information extraction through TSHD segmentation and advanced deep learning techniques
2025cites this paper
Improving Single-round Active Adaptation: A Prediction Variability Perspective
2025cites this paper
MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables
2025cites this paper
It's High Time: A Survey of Temporal Question Answering
2025influential citation
Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis
2025cites this paper
ChatPD: An LLM-driven Paper-Dataset Networking System
2025cites this paper
HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation
2025cites this paper
FReM: A Flexible Reasoning Mechanism for Balancing Quick and Slow Thinking in Long-Context Question Answering
2025cites this paper
Compass-V2 Technical Report
2025cites this paper
EnronQA: Towards Personalized RAG over Private Documents
2025cites this paper
Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering
2025influential citation
Multi-Model System for Enhanced QA: Contextual Answering, Profanity Filtering, and Pair Generation
2025cites this paper
UQuAD+: Benchmark Dataset for Urdu Machine Reading Comprehension
2025cites this paper
InsightQA: Descriptive Question and Answer Generation
2025cites this paper
A literature review of research on question generation in education
2025cites this paper
Visual Self-Refinement for Autoregressive Models
2025cites this paper
Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors
2025cites this paper
Multilingual approaches to extractive question answering in political texts
2025cites this paper
It's High Time: A Survey of Temporal Information Retrieval and Question Answering
2025influential citation
ImpRAG: Retrieval-Augmented Generation with Implicit Queries
2025cites this paper
SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence
2025cites this paper
Prompt Learning for Few-Shot Question Answering via Self-Context Data Augmentation
2025cites this paper
Re3: Learning to Balance Relevance & Recency for Temporal Information Retrieval
2025cites this paper
MEPIC: Memory Efficient Position Independent Caching for LLM Serving
2025influential citation
RAG-LER: Ranking adapted generation with language-model enabled regulation
2025cites this paper
Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization
2025cites this paper
Enhancing Language Model Hypernetworks with Restart: A Study on Optimization
2025cites this paper
Deceiving question-answering models: A hybrid word-level adversarial approach
2025cites this paper
Explainable Semantic Text Relations: A Question-Answering Framework for Comparing Document Content
2025cites this paper
Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective
2025cites this paper
TEEMIL : Towards Educational MCQ Difficulty Estimation in Indic Languages
2025cites this paper
PIP-KAG: Mitigating Knowledge Conflicts in Knowledge-Augmented Generation via Parametric Pruning
2025influential citation
ADePT: Adaptive Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
2025cites this paper
Dynamic Attention-Guided Context Decoding for Mitigating Context Faithfulness Hallucinations in Large Language Models
2025influential citation
Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View
2025influential citation
ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation
2025cites this paper
KPQDG: a key sentence prompt-tuning for automatic question-distractor pairs generation
2025cites this paper
Dynamic Task Vector Grouping for Efficient Multi-Task Prompt Tuning
2025cites this paper
aiXamine: Simplified LLM Safety and Security
2025cites this paper
Certified Mitigation of Worst-Case LLM Copyright Infringement
2025cites this paper
Bayesian Multi-Task Transfer Learning for Soft Prompt Tuning
2024cites this paper
Building a Rich Dataset to Empower the Persian Question Answering Systems
2024cites this paper
Self Data Augmentation for Open Domain Question Answering
2024cites this paper
UnAnswGen: A Systematic Approach for Generating Unanswerable Questions in Machine Reading Comprehension
2024cites this paper
Audiopedia: Audio QA with Knowledge
2024cites this paper
Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown
2024cites this paper
An Index Bucketing Framework to Support Data Manipulation and Extraction of Nested Data Structures
2024cites this paper
Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems
2024influential citation
Survey of Large Language Models for Answering Questions Across Various Fields
2024cites this paper
ACCEPT: Adaptive Codebook for Composite and Efficient Prompt Tuning
2024cites this paper
Continuous Risk Prediction
2024cites this paper
VERITAS: A Unified Approach to Reliability Evaluation
2024cites this paper
Investigating Large Language Models for Prompt-Based Open-Ended Question Generation in the Technical Domain
2024cites this paper
PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain
2024cites this paper
The Research on Intelligent News Advertisement Recommendation Algorithm Based on Prompt Learning in End-to-End Large Language Model Architecture
2024cites this paper
FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"
2024cites this paper
EfficientRAG: Efficient Retriever for Multi-Hop Question Answering
2024cites this paper
ChatQA: Building GPT-4 Level Conversational QA Models
2024cites this paper
Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval
2024influential citation
PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness
2024cites this paper
Evaluating Copyright Takedown Methods for Language Models
2024cites this paper
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
2024cites this paper
Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox
2024cites this paper
TRUE-UIE: Two Universal Relations Unify Information Extraction Tasks
2024cites this paper
Instruction Pre-Training: Language Models are Supervised Multitask Learners
2024cites this paper
Multi-Task Learning with Adapters for Plausibility Prediction: Bridging the Gap or Falling into the Trenches?
2024cites this paper
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
2024cites this paper
Codebook LLMs: Adapting Political Science Codebooks for LLM Use and Adapting LLMs to Follow Codebooks
2024cites this paper
A Study of DistilBERT-Based Answer Extraction Machine Reading Comprehension Algorithm
2024cites this paper
Source-Free Unsupervised Domain Adaptation for Question Answering via Prompt-Assisted Self-learning
2024influential citation
WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy Domain
2024cites this paper
TibetanQA2.0: Dataset with Unanswerable Questions for Tibetan Machine Reading Comprehension
2024cites this paper
SFR-RAG: Towards Contextually Faithful LLMs
2024cites this paper
Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models
2024cites this paper
Language Models for Multi-Lingual Tasks - A Survey
2024cites this paper
Paraphrasing in Affirmative Terms Improves Negation Understanding
2024cites this paper
RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs
2024cites this paper
Leveraging the Domain Adaptation of Retrieval Augmented Generation Models for Question Answering and Reducing Hallucination
2024cites this paper
Mitigating Knowledge Conflicts in Language Model-Driven Question Answering
2024cites this paper
ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering
2024cites this paper