Accelerating Training Speed of Tiny Recursive Models with Curriculum Guided Adaptive Recursion

Published 2025 in Unknown venue

ABSTRACT

Background: Recursive reasoning models achieve strong performance through iterative refinement, allowing small networks to match large language models. However, training is computationally expensive, often requiring 36 GPU-hours for Sudoku extreme. Existing models use fixed recursion depth and uniform supervision weighting, leading to inefficient training. Objectives: We propose CGAR (Curriculum-Guided Adaptive Recursion), applying curriculum learning to architectural depth. CGAR introduces Progressive Depth Curriculum (PDC) to dynamically adjust recursion depth and Hierarchical Supervision Weighting (HSW) to apply exponentially decaying importance to supervision steps. Methods: PDC implements a three-stage schedule transitioning from shallow (2, 1) to full depth (6, 3) configurations, providing 41.4% FLOPs reduction. HSW applies exponential decay to supervision steps, achieving 40% gradient variance reduction and accelerated convergence. Results: On Sudoku-Extreme, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours) with only a 0.63% accuracy drop (86.65% to 86.02%). PDC alone achieves 2.26x speedup with 85.47% accuracy, showing a Pareto improvement in efficiency and quality. HSW provides 1.61x speedup. CGAR-trained models show superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps. Conclusions: CGAR enables efficient training of recursive models on modest hardware. By treating depth as a scheduled parameter, it achieves substantial savings and prevents overfitting, making these models practical for neurosymbolic AI and program synthesis. https://github.com/Kaleemullahqasim/CGAR and huggingface.co/Kaleemullah/trm-cgar-sudoku.

PUBLICATION RECORD

Publication year
2025
Venue
Unknown venue
Publication date
2025-11-11
Fields of study
Computer Science
Identifiers
arXiv 2511.08653
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models
2025cited by this paper
Why Warmup the Learning Rate? Underlying Mechanisms and Improvements
2024cited by this paper
Loop Neural Networks for Parameter Sharing
2024cited by this paper
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
2024cited by this paper
Self-Refine: Iterative Refinement with Self-Feedback
2023cited by this paper
Reflexion: language agents with verbal reinforcement learning
2023cited by this paper
Improving Deep Neural Networks’ Training for Image Classification With Nonlinear Conjugate Gradient-Style Adaptive Momentum
2023cited by this paper
A Survey on Efficient Training of Transformers
2023cited by this paper
PaLM: Scaling Language Modeling with Pathways
2022cited by this paper
EfficientNetV2: Smaller Models and Faster Training
2021cited by this paper
PonderNet: Learning to Ponder
2021cited by this paper
Dynamic Neural Networks: A Survey
2021cited by this paper
BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression
2021cited by this paper
Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
On The Power of Curriculum Learning in Training Deep Networks
2019cited by this paper
Universal Transformers
2018cited by this paper
Neural Ordinary Differential Equations
2018cited by this paper
PAD-Net: Multi-tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing
2018cited by this paper
SkipNet: Learning Dynamic Routing in Convolutional Networks
2017cited by this paper
GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks
2017cited by this paper
Automated Curriculum Learning for Neural Networks
2017cited by this paper
Adaptive Computation Time for Recurrent Neural Networks
2016cited by this paper
Progressive Neural Networks
2016cited by this paper
Deep Networks with Stochastic Depth
2016cited by this paper
Deeply-Supervised Nets
2014cited by this paper
A measure of intelligence
2012cited by this paper
Curriculum learning
2009cited by this paper

CITED BY

Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
2026cites this paper