A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances

Brian B. Moser,Arundhati S. Shanbhag,Stanislav Frolov,Federico Raue,Joachim Folz,Andreas Dengel

Published 2025 in arXiv.org

ABSTRACT

Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-05-23
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.48550/arXiv.2505.17799 arXiv 2505.17799
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

SynCo-OOD: Synthetic-contrastive learning for graph out-of-distribution detection
2026cited by this paper
STAFF: Speculative Coreset Selection for Task-Specific Fine-tuning
2025cited by this paper
Coreset-Based Task Selection for Sample-Efficient Meta-Reinforcement Learning
2025cited by this paper
Reducing annotation effort in agricultural data: simple and fast unsupervised coreset selection with DINOv2 and K-means
2025cited by this paper
HyperCore: Coreset Selection under Noise via Hypersphere Models
2025cited by this paper
Unsupervised Wind Turbine Blade Damage Detection With Memory-Aided Denoising Reconstruction
2025cited by this paper
Efficient and Effective In-context Demonstration Selection with Coreset
2025cited by this paper
Machine
2025cited by this paper
The Evolution of Dataset Distillation: Toward Scalable and Generalizable Solutions
2025cited by this paper
Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images
2025cited by this paper
UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
2025cited by this paper
Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty
2025cited by this paper
Distillation vs. Sampling for Efficient Training of Learning to Rank Models
2024cited by this paper
Fast Static and Dynamic Approximation Algorithms for Geometric Optimization Problems: Piercing, Independent Set, Vertex Cover, and Matching
2024cited by this paper
D4M: Dataset Distillation via Disentangled Diffusion Model
2024cited by this paper
SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training
2024influential reference
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning
2024cited by this paper
Coreset Selection for Object Detection
2024cited by this paper
DRoP: Distributionally Robust Data Pruning
2024cited by this paper
SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution
2024cited by this paper
A Study in Dataset Pruning for Image Super-Resolution
2024cited by this paper
Coreset Discovery for Machine Learning Problems
2024cited by this paper
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
2024cited by this paper
Gradient Coreset for Federated Learning
2024cited by this paper
Towards Effective Multiple-in-One Image Restoration: A Sequential and Prompt Learning Strategy
2024cited by this paper
A CLIP-Powered Framework for Robust and Generalizable Data Selection
2024cited by this paper
Dual-Enhanced Coreset Selection with Class-Wise Collaboration for Online Blurry Class Incremental Learning
2024cited by this paper
BloomCoreset: Fast Coreset Sampling using Bloom Filters for Fine-Grained Self-Supervised Learning
2024cited by this paper
Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary
2024cited by this paper
Label-Guided Coreset Generation for Computationally Efficient Chest X-Ray Diagnosis
2024cited by this paper
Application of Dataset Pruning and Dynamic Transfer Learning on Vision Transformers for MGMT Prediction on Brain MRI Images
2024cited by this paper
Distill the Best, Ignore the Rest: Improving Dataset Distillation with Loss-Value-Based Pruning
2024cited by this paper
RedPajama: an Open Dataset for Training Large Language Models
2024cited by this paper
FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular Data
2024cited by this paper
Distill Gold from Massive Ores: Bi-level Data Pruning Towards Efficient Dataset Distillation
2024cited by this paper
In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models
2024cited by this paper
Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
2023cited by this paper
Probabilistic Bilevel Coreset Selection
2023cited by this paper
Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints
2023cited by this paper
Not All Patches Are Equal: Hierarchical Dataset Condensation for Single Image Super-Resolution
2023cited by this paper
LLMaAA: Making Large Language Models as Active Annotators
2023cited by this paper
You Only Condense Once: Two Rules for Pruning Condensed Datasets
2023cited by this paper
ASP: Automatic Selection of Proxy dataset for efficient AutoML
2023cited by this paper
D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning
2023cited by this paper
Coreset selection can accelerate quantum machine learning models with provable generalization
2023cited by this paper
Dynamic Attention-Guided Diffusion for Image Super-Resolution
2023cited by this paper
Counterfactual Active Learning for Out-of-Distribution Generalization
2023cited by this paper
Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning
2023cited by this paper
Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective
2023cited by this paper
NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks
2023cited by this paper
Near-Optimal Quantum Coreset Construction Algorithms for Clustering
2023cited by this paper
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
2023cited by this paper
Map-based experience replay: a memory-efficient solution to catastrophic forgetting in reinforcement learning
2023cited by this paper
Generalizing Dataset Distillation via Deep Generative Prior
2023cited by this paper
Exploiting redundancy in large materials datasets for efficient machine learning with less data
2023cited by this paper
DINOv2: Learning Robust Visual Features without Supervision
2023cited by this paper
Exploring the Limits of Deep Image Clustering using Pretrained Models
2023cited by this paper
Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning
2023cited by this paper
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
2023cited by this paper
Data-Efficient Training of CNNs and Transformers with Coresets: A Stability Perspective
2023cited by this paper
Selective experience replay compression using coresets for lifelong deep reinforcement learning in medical imaging
2023cited by this paper
Neural Architecture Search Survey: A Computer Vision Perspective
2023cited by this paper
A Comprehensive Survey of Continual Learning: Theory, Method and Application
2023cited by this paper
Reproducible Scaling Laws for Contrastive Language-Image Learning
2022cited by this paper
Coverage-centric Coreset Selection for High Pruning Rates
2022influential reference
A survey on computationally efficient neural architecture search
2022cited by this paper
Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification
2022cited by this paper
Fed-CBS: A Heterogeneity-Aware Client Sampling Mechanism for Federated Learning via Class-Imbalance Reduction
2022cited by this paper
ORIENT: Submodular Mutual Information Measures for Data Subset Selection under Distribution Shift
2022cited by this paper
Beyond neural scaling laws: beating power law scaling via data pruning
2022cited by this paper
Performance analysis of coreset selection for quantum implementation of K-Means clustering algorithm
2022cited by this paper
Sampling with Trustworthy Constraints: A Variational Gradient Framework
2022cited by this paper
DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning
2022influential reference
Scaling Laws for Reward Model Overoptimization
2022cited by this paper
Hierarchical Text-Conditional Image Generation with CLIP Latents
2022cited by this paper
Dataset Distillation by Matching Training Trajectories
2022cited by this paper
Adaptive Patch Exiting for Scalable Single Image Super-Resolution
2022cited by this paper
Less is More: Proxy Datasets in NAS approaches
2022cited by this paper
Predictability and Surprise in Large Generative Models
2022cited by this paper
Deep Clustering: A Comprehensive Survey
2022cited by this paper
PRISM: A Rich Class of Parameterized Submodular Information Measures for Guided Data Subset Selection
2021cited by this paper
Learning Transferable Visual Models From Natural Language Supervision
2021cited by this paper
Understanding deep learning (still) requires rethinking generalization
2021cited by this paper
ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic
2021cited by this paper
In the light of feature distributions: moment matching for Neural Style Transfer
2021cited by this paper
Emerging Properties in Self-Supervised Vision Transformers
2021cited by this paper
Submodular Mutual Information for Targeted Data Subset Selection
2021cited by this paper
Minority Class Oriented Active Learning for Imbalanced Datasets
2021cited by this paper
Online Coreset Selection for Rehearsal-based Continual Learning
2021cited by this paper
Accelerating Neural Architecture Search via Proxy Data
2021cited by this paper
GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training
2021cited by this paper
RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning
2021cited by this paper
VAE-based Deep SVDD for anomaly detection
2021cited by this paper
SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios
2021cited by this paper
Deep Learning on a Data Diet: Finding Important Examples Early in Training
2021cited by this paper
Batch Active Learning at Scale
2021cited by this paper
Active Learning by Acquiring Contrastive Examples
2021cited by this paper
Dataset Condensation with Distribution Matching
2021cited by this paper
Self-feature Learning: An Efficient Deep Lightweight Network for Image Super-resolution
2021cited by this paper
Robust and Fully-Dynamic Coreset for Continuous-and-Bounded Learning (With Outliers) Problems
2021cited by this paper

CITED BY

A Dataset is Worth 1 MB
2026cites this paper
ULNet: Federated Unlearning for SDN Control-Plane Anomaly Detection
2026cites this paper
A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)
2026cites this paper
ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios
2026cites this paper
Stop Preaching and Start Practising Data Frugality for Responsible Development of AI
2026cites this paper
Learning from Complexity: Exploring Dynamic Sample Pruning of Spatio-Temporal Training
2026cites this paper
Bound to Disagree: Generalization Bounds via Certifiable Surrogates
2026cites this paper
Iterative Misclassification Error Training (IMET): An Optimized Neural Network Training Technique for Image Classification
2025cites this paper
DUSE: A Data Expansion Framework for Low-resource Automatic Modulation Recognition based on Active Learning
2025cites this paper
SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection
2025cites this paper
SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone
2025cites this paper
Adaptive Data Selection for Multi-Layer Perceptron Training: A Sub-linear Value-Driven Method
2025cites this paper
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction
2025cites this paper
Choose Before You Label: Efficient Node Selection in Constrained Federated Learning
2025cites this paper