Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics

Published 2026 in arXiv.org

ABSTRACT

Curriculum learning changes the order of pre-training data, but it remains unclear whether it changes the learning trajectory or mainly reorders exposure over a fixed trajectory. We train Pythia models (14M-410M parameters) for 300B tokens under three linguistically motivated curricula-Age-of-Acquisition, word frequency, and Verb Variation (VV)-and compare each against Random ordering; at 1B parameters we compare Random and VV. Across orderings, training follows a shared sequence of latent phases, while curricula mainly change within-phase data exposure. In smaller models (up to 160M parameters), Random ordering exhibits higher gradient noise and stronger late-training output-head spectral saturation, alongside lower final accuracy; curricula reduce both effects at matched compute. At larger scales, saturation differences are smaller and curriculum gains shrink. We formalize the link between difficulty pacing and optimization stability in an idealized analysis based on gradient-variance control, and our results point to a practical takeaway: curricula help by stabilizing within-phase optimization rather than by creating new phases.

PUBLICATION RECORD

Publication year
2026
Venue
arXiv.org
Publication date
2026-01-29
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.48550/arXiv.2601.21698 arXiv 2601.21698
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs
2025influential reference
On Training Data Influence of GPT Models
2024cited by this paper
Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
2024influential reference
Emergent inabilities? Inverse scaling over the course of pretraining
2023cited by this paper
Irreducible Curriculum for Language Model Pretraining
2023cited by this paper
Latent State Models of Training Dynamics
2023cited by this paper
A Mathematical Model for Curriculum Learning
2023cited by this paper
Handbook of Convergence Theorems for (Stochastic) Gradient Methods
2023cited by this paper
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
2023cited by this paper
Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
2023cited by this paper
Training Compute-Optimal Large Language Models
2022cited by this paper
Length-Based Curriculum Learning for Efficient Pre-training of Language Models
2022cited by this paper
Adaptive Curriculum Learning
2021cited by this paper
The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models
2021cited by this paper
Curriculum learning for language modeling
2021cited by this paper
Scaling Laws for Neural Language Models
2020cited by this paper
Estimating Training Data Influence by Tracking Gradient Descent
2020cited by this paper
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
2020cited by this paper
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
2019cited by this paper
PIQA: Reasoning about Physical Commonsense in Natural Language
2019cited by this paper
On The Power of Curriculum Learning in Training Deep Networks
2019cited by this paper
An Adversarial Winograd Schema Challenge at Scale
2019cited by this paper
One Epoch Is All You Need
2019cited by this paper
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
2018cited by this paper
Minimax Curriculum Learning: Machine Teaching with Desirable Difficulties and Scheduled Diversity
2018cited by this paper
An Empirical Model of Large-Batch Training
2018influential reference
Curriculum Learning by Transfer Learning: Theory and Experiments with Deep Networks
2018cited by this paper
Understanding Black-box Predictions via Influence Functions
2017cited by this paper
Crowdsourcing Multiple Choice Science Questions
2017cited by this paper
The LAMBADA dataset: Word prediction requiring a broad discourse context
2016cited by this paper
Optimization Methods for Large-Scale Machine Learning
2016cited by this paper
Subtlex-UK: A New and Improved Word Frequency Database for British English
2014cited by this paper
Age-of-acquisition ratings for 30,000 English words
2012cited by this paper
Curriculum learning
2009cited by this paper
Learning and development in neural networks: the importance of starting small.
1993cited by this paper
Problèmes et méthodes de la statistique linguistique
1960cited by this paper

CITED BY

No citing papers are available for this paper.