Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark,Isaac Cowhey,Oren Etzioni,Tushar Khot,Ashish Sabharwal,Carissa Schoenick,Oyvind Tafjord

Published 2018 in arXiv.org

ABSTRACT

We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community.

PUBLICATION RECORD

Publication year
2018
Venue
arXiv.org
Publication date
2018-03-14
Fields of study
Computer Science
Identifiers
arXiv 1803.05457
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

SciTaiL: A Textual Entailment Dataset from Science Question Answering
2018influential reference
Annotation Artifacts in Natural Language Inference Data
2018cited by this paper
Question Answering as Global Reasoning Over Semantic Abstractions
2018cited by this paper
Answering Complex Questions Using Open Information Extraction
2017cited by this paper
Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension
2017cited by this paper
Crowdsourcing Multiple Choice Science Questions
2017cited by this paper
Adversarial Examples for Evaluating Reading Comprehension Systems
2017cited by this paper
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
2017cited by this paper
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
2017influential reference
My Computer Is an Honor Student - but How Intelligent Is It? Standardized Tests as a Measure of AI
2016influential reference
Query-Reduction Networks for Question Answering
2016influential reference
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016influential reference
Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions
2016influential reference
Moving beyond the Turing Test with the Allen AI Science Challenge
2016cited by this paper
Question Answering via Integer Programming over Semi-Structured Knowledge
2016cited by this paper
How to Write Science Questions that Are Easy for People and Hard for Computers
2016cited by this paper
A Decomposable Attention Model for Natural Language Inference
2016influential reference
Bidirectional Attention Flow for Machine Comprehension
2016influential reference
NewsQA: A Machine Comprehension Dataset
2016cited by this paper
Tracking the World State with Recurrent Entity Networks
2016cited by this paper
Teaching Machines to Read and Comprehend
2015cited by this paper
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
2015cited by this paper
Memory Networks
2014cited by this paper
Overview of Todai Robot Project and Evaluation Framework of its NLP-based Problem Solving
2014cited by this paper
Diagram Understanding in Geometry Questions
2014cited by this paper
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
2013cited by this paper
Can an AI get into the University of Tokyo
2013cited by this paper
Word Association Norms, Mutual Information, and Lexicography
1989cited by this paper

CITED BY

KinGuard: Hierarchical Kinship-Aware Fingerprinting to Defend Against Large Language Model Stealing
2026cites this paper
Learning the Mechanism of Catastrophic Forgetting: A Perspective from Gradient Similarity
2026cites this paper
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
2026cites this paper
From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning
2026influential citation
Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction
2026cites this paper
Discovering Hidden Gems in Model Repositories
2026cites this paper
Modular Prompt Optimization: Optimizing Structured Prompts with Section-Local Textual Gradients
2026cites this paper
FinForge: Semi-Synthetic Financial Benchmark Generation
2026cites this paper
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering
2026cites this paper
Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models
2026influential citation
Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection
2026cites this paper
SDUs DAISY: A Benchmark for Danish Culture
2026cites this paper
CoFrGeNet: Continued Fraction Architectures for Language Generation
2026cites this paper
Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis
2026cites this paper
The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining
2026cites this paper
FLEx: Language Modeling with Few-shot Language Explanations
2026cites this paper
DR-LoRA: Dynamic Rank LoRA for Mixture-of-Experts Adaptation
2026influential citation
Learning to Trust the Crowd: A Multi-Model Consensus Reasoning Engine for Large Language Models
2026cites this paper
Attention Projection Mixing with Exogenous Anchors
2026cites this paper
D2Prune: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
2026cites this paper
Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models
2026influential citation
HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network
2026cites this paper
Suppressing Final Layer Hidden State Jumps in Transformer Pretraining
2026influential citation
RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents
2026cites this paper
TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching
2026cites this paper
The Epistemological AI Turn: From JTB to KnowledgeS
2026cites this paper
RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization
2026cites this paper
MAR: Efficient Large Language Models via Module-aware Architecture Refinement
2026cites this paper
KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices
2026cites this paper
Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models
2026cites this paper
Entropy-Based Data Selection for Language Models
2026cites this paper
When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents
2026cites this paper
Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration
2026cites this paper
Benchmark^2: Systematic Evaluation of LLM Benchmarks
2026influential citation
STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules
2026cites this paper
Shadow Unlearning: A Neuro-Semantic Approach to Fidelity-Preserving Faceless Forgetting in LLMs
2026cites this paper
Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths
2026cites this paper
XBTorch: A Unified Framework for Modeling and Co-Design of Crossbar-Based Deep Learning Accelerators
2026cites this paper
Coverage Improvement and Fast Convergence of On-policy Preference Learning
2026cites this paper
When Models Know When They Do Not Know: Calibration, Cascading, and Cleaning
2026cites this paper
Sliced-Wasserstein Distribution Alignment Loss Improves the Ultra-Low-Bit Quantization of Large Language Models
2026cites this paper
Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats
2026cites this paper
Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment
2026cites this paper
BYOL: Bring Your Own Language Into LLMs
2026cites this paper
A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms
2026cites this paper
Threshold Differential Attention for Sink-Free, Ultra-Sparse, and Non-Dispersive Language Modeling
2026cites this paper
Martingale Foresight Sampling: A Principled Approach to Inference-Time LLM Decoding
2026cites this paper
Sycophancy Hides Linearly in the Attention Heads
2026cites this paper
ShapLoRA: Allocation of Low-rank Adaption on Large Language Models via Shapley Value Inspired Importance Estimation
2026cites this paper
Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models
2026cites this paper
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
2026cites this paper
One Token Is Enough: Improving Diffusion Language Models with a Sink Token
2026cites this paper
M2XFP: A Metadata-Augmented Microscaling Data Format for Efficient Low-bit Quantization
2026cites this paper
GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs
2026cites this paper
HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs
2026cites this paper
Demystifying Multi-Agent Debate: The Role of Confidence and Diversity
2026cites this paper
MoCo: A One-Stop Shop for Model Collaboration Research
2026cites this paper
GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization
2026cites this paper
Making Foundation Models Probabilistic via Singular Value Ensembles
2026cites this paper
Shaping capabilities with token-level data filtering
2026cites this paper
TBDFiltering: Sample-Efficient Tree-Based Data Filtering
2026influential citation
KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement
2026cites this paper
Oiso: Outlier-Isolated Data Format for Low-Bit Large Language Model Quantization
2026cites this paper
Rising From Pieces: Effective Inference at the Edge via Robust Split ML
2026cites this paper
HybridMoE: LoRA-Based LLMs Fine-Tune With Hybrid Mixture of Experts
2026cites this paper
DEFT: Data-Efficient Fine-Tuning Through Multi-Dimensional Data Selection
2026cites this paper
HFRWKV: A High-Performance Fully On-Chip Hardware Accelerator for RWKV
2026cites this paper
Attention Needs to Focus: A Unified Perspective on Attention Allocation
2026cites this paper
MiMo-V2-Flash Technical Report
2026cites this paper
SLIM: Stealthy Low-Coverage Black-Box Watermarking via Latent-Space Confusion Zones
2026cites this paper
When Models Decide and When They Bind: A Two-Stage Computation for Multiple-Choice Question-Answering
2026cites this paper
Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks
2026cites this paper
Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding
2026cites this paper
Inhibitory Attacks on Backdoor-based Fingerprinting for Large Language Models
2026cites this paper
ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging
2026cites this paper
ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs
2026cites this paper
Structured Reasoning for Large Language Models
2026cites this paper
Breaking Model Lock-in: Cost-Efficient Zero-Shot LLM Routing via a Universal Latent Space
2026cites this paper
Does Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency Tradeoffs
2026cites this paper
Monkey Jump : MoE-Style PEFT for Efficient Multi-Task Learning
2026cites this paper
Ministral 3
2026cites this paper
Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification
2026cites this paper
ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models
2026cites this paper
Controlled LLM Training on Spectral Sphere
2026cites this paper
Dynamic Consensus Communication Mechanism for Large Language Model-Based Multi-Agent Systems
2026cites this paper
GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization
2026cites this paper
STEM: Scaling Transformers with Embedding Modules
2026cites this paper
Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
2026cites this paper
Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers
2026cites this paper
FAQ: Mitigating Quantization Error via Regenerating Calibration Data with Family-Aware Quantization
2026cites this paper
Low-Rank Key Value Attention
2026cites this paper
Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction
2026cites this paper
CTPD: Cross Tokenizer Preference Distillation
2026cites this paper
DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity
2026cites this paper
AutoDriDM: An Explainable Benchmark for Decision-Making of Vision-Language Models in Autonomous Driving
2026cites this paper
On the Runway Cascade of Transformers for Language Modeling
2026cites this paper
Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow
2026cites this paper
A Chinese Elementary Science Question Dataset in Problem-Solving Process Generation
2026cites this paper
From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation
2026cites this paper
L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
2026influential citation