Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

Published 2021 in International Conference on Machine Learning

ABSTRACT

We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels' minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.

PUBLICATION RECORD

Publication year
2021
Venue
International Conference on Machine Learning
Publication date
2021-03-05
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 2103.03872
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Cover
2022cited by this paper
Visual Question Answering: From Theory to Application
2022cited by this paper
BERT & Family Eat Word Salad: Experiments with Text Understanding
2021influential reference
Predicting Inductive Biases of Pre-Trained Models
2021influential reference
Language (Technology) is Power: A Critical Survey of “Bias” in NLP
2020cited by this paper
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
Multi-Dimensional Gender Bias Classification
2020cited by this paper
UnNatural Language Inference
2020influential reference
Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks?
2020influential reference
Longformer: The Long-Document Transformer
2020influential reference
Information-Theoretic Probing for Linguistic Structure
2020cited by this paper
Information-Theoretic Probing with Minimum Description Length
2020cited by this paper
Unsupervised Question Decomposition for Question Answering
2020influential reference
Calibration of Pre-trained Transformers
2020cited by this paper
Evaluating representations by the complexity of learning low-loss predictors
2020influential reference
Evaluating Explanations: How Much Do Explanations from the Teacher Aid Students?
2020cited by this paper
Measuring Association Between Labels and Free-Text Rationales
2020cited by this paper
Explain Yourself! Leveraging Language Models for Commonsense Reasoning
2019cited by this paper
Compositional Questions Do Not Necessitate Multi-hop Reasoning
2019influential reference
Multi-hop Reading Comprehension through Question Decomposition and Rescoring
2019influential reference
Probing What Different NLP Tasks Teach Machines about Function Word Comprehension
2019cited by this paper
Language Models are Unsupervised Multitask Learners
2019cited by this paper
The Curious Case of Neural Text Degeneration
2019influential reference
Item response theory in AI: Analysing machine learning classifiers at the instance level
2019cited by this paper
Understanding Dataset Design Choices for Multi-hop Reasoning
2019cited by this paper
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
2019influential reference
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
2019cited by this paper
Learning and Evaluating General Linguistic Intelligence
2019cited by this paper
Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets
2019influential reference
Queens Are Powerful Too: Mitigating Gender Bias in Dialogue Generation
2019cited by this paper
Adversarial NLI: A New Benchmark for Natural Language Understanding
2019cited by this paper
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
2019cited by this paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
2019influential reference
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
2019cited by this paper
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
2019cited by this paper
Finding Generalizable Evidence by Learning to Convince Q&A Models
2019influential reference
Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds
2019cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA
2019influential reference
Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
2018cited by this paper
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
2018cited by this paper
Datasheets for datasets
2018cited by this paper
Annotation Artifacts in Natural Language Inference Data
2018influential reference
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
2018cited by this paper
Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment
2018cited by this paper
Hypothesis Only Baselines in Natural Language Inference
2018cited by this paper
What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties
2018cited by this paper
Neural Network Acceptability Judgments
2018influential reference
The Description Length of Deep Learning models
2018cited by this paper
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
2018cited by this paper
Understanding the Origins of Bias in Word Embeddings
2018cited by this paper
e-SNLI: Natural Language Inference with Natural Language Explanations
2018influential reference
Measuring and Mitigating Unintended Bias in Text Classification
2018cited by this paper
Building Machines that Learn and Think Like People
2018influential reference
Social Bias in Elicited Natural Language Inferences
2017cited by this paper
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
2017influential reference
On Calibration of Modern Neural Networks
2017cited by this paper
FiLM: Visual Reasoning with a General Conditioning Layer
2017cited by this paper
Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints
2017cited by this paper
Gender and Dialect Bias in YouTube’s Automatic Captions
2017cited by this paper
Attention is All you Need
2017influential reference
Understanding Black-box Predictions via Influence Functions
2017cited by this paper
Mixed Precision Training
2017cited by this paper
Adversarial Examples for Evaluating Reading Comprehension Systems
2017cited by this paper
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
2017influential reference
Semantics derived automatically from language corpora contain human-like biases
2016cited by this paper
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
2016cited by this paper
SQuAD: 100,000+ Questions for Machine Comprehension of Text
2016cited by this paper
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
2016cited by this paper
Bag of Tricks for Efficient Text Classification
2016cited by this paper
Understanding intermediate layers using linear classifier probes
2016cited by this paper
Does String-Based Neural MT Learn Source Syntax?
2016cited by this paper
Rationale-Augmented Convolutional Neural Networks for Text Classification
2016influential reference
A large annotated corpus for learning natural language inference
2015cited by this paper
Yin and Yang: Balancing and Answering Binary Visual Questions
2015cited by this paper
Challenges of studying and processing dialects in social media
2015cited by this paper
VQA: Visual Question Answering
2015cited by this paper
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
2014cited by this paper
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013influential reference
Models of Translation Competitions
2013cited by this paper
The Winograd Schema Challenge
2011influential reference
The Sixth PASCAL Recognizing Textual Entailment Challenge
2009influential reference
The Fourth PASCAL Recognizing Textual Entailment Challenge
2008cited by this paper
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
2006cited by this paper
A Mathematical Theory of Communication
2006cited by this paper
Automatically Constructing a Corpus of Sentential Paraphrases
2005cited by this paper
Elements of Information Theory
2005cited by this paper
A tutorial introduction to the minimum description length principle
2004influential reference
Bleu: a Method for Automatic Evaluation of Machine Translation
2002influential reference
Item Response Theory: Parameter Estimation Techniques
1998cited by this paper
The FERET evaluation methodology for face-recognition algorithms
1997cited by this paper
Keeping the neural networks simple by minimizing the description length of the weights
1993cited by this paper
Universal coding, information, prediction, and estimation
1984cited by this paper
Occam's razor
1980cited by this paper
Modeling By Shortest Data Description*
1978influential reference
Three approaches to the quantitative definition of information
1968cited by this paper

CITED BY

Revisiting Generalization Across Difficulty Levels: It's Not So Easy
2025cites this paper
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
2024cites this paper
Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information
2024cites this paper
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning
2024cites this paper
GeoHard: Towards Measuring Class-wise Hardness through Modelling Class Semantics
2024cites this paper
Thrust: Adaptively Propels Large Language Models with External Knowledge
2023cites this paper
Investigating UD Treebanks via Dataset Difficulty Measures
2023cites this paper
On Dataset Transferability in Active Learning for Transformers
2023cites this paper
The Map Equation Goes Neural
2023cites this paper
Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness
2023cites this paper
Measuring Pointwise $\mathcal{V}$-Usable Information In-Context-ly
2023cites this paper
Bridging Information-Theoretic and Geometric Compression in Language Models
2023cites this paper
Data Similarity is Not Enough to Explain Language Model Performance
2023cites this paper
Incorporating Syntactic Knowledge into Pre-trained Language Model using Optimization for Overcoming Catastrophic Forgetting
2023cites this paper
A Latent-Variable Model for Intrinsic Probing
2022cites this paper
Sequential Learning Of Neural Networks for Prequential MDL
2022cites this paper
On Measuring the Intrinsic Few-Shot Hardness of Datasets
2022influential citation
Which Shortcut Solution Do Question Answering Models Prefer to Learn?
2022cites this paper
Understanding Dataset Difficulty with V-Usable Information
2021cites this paper
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little
2021cites this paper
True Few-Shot Learning with Language Models
2021cites this paper
What Context Features Can Transformer Language Models Use?
2021cites this paper
Comparing Text Representations: A Theory-Driven Approach
2021cites this paper
Rationales for Sequential Predictions
2021cites this paper
M ARKUP M N A: Markup-Based Segmentation of M&A Agreements
year unknowncites this paper