Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence

Sean McLeish,Ang Li,John Kirchenbauer,Dayal Singh Kalra,Brian R. Bartoldson,B. Kailkhura,Avi Schwarzschild,Jonas Geiping,Tom Goldstein,Micah Goldblum

Published 2025 in arXiv.org

ABSTRACT

Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute. In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models. We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost. In our experiments, on mathematics, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-11-10
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.48550/arXiv.2511.07384 arXiv 2511.07384
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling
2025cited by this paper
A Survey on Latent Reasoning
2025cited by this paper
Hierarchical Reasoning Model
2025cited by this paper
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
2025cited by this paper
Pretraining Language Models to Ponder in Continuous Space
2025cited by this paper
Do Language Models Use Their Depth Efficiently?
2025cited by this paper
Reasoning with Latent Thoughts: On the Power of Looped Transformers
2025cited by this paper
A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers
2025cited by this paper
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking
2025cited by this paper
Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement
2025cited by this paper
Gemstones: A Model Suite for Multi-Faceted Scaling Laws
2025cited by this paper
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
2025influential reference
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
2025cited by this paper
Energy-Based Transformers are Scalable Learners and Thinkers
2025cited by this paper
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
2025cited by this paper
Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts
2025cited by this paper
Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset
2025cited by this paper
Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
2025cited by this paper
Transformers Can Do Arithmetic with the Right Embeddings
2024cited by this paper
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
2024cited by this paper
Linearizing Large Language Models
2024cited by this paper
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
2024cited by this paper
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
2024cited by this paper
LaCo: Large Language Model Pruning via Layer Collapse
2024cited by this paper
AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures
2024cited by this paper
2 OLMo 2 Furious
2024influential reference
Training Large Language Models to Reason in a Continuous Latent Space
2024cited by this paper
Rethinking Deep Thinking: Stable Learning of Algorithms using Lipschitz Constraints
2024cited by this paper
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
2024influential reference
On the Inductive Bias of Stacking Towards Improving Reasoning
2024cited by this paper
Looped Transformers for Length Generalization
2024cited by this paper
Loop Neural Networks for Parameter Sharing
2024cited by this paper
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
2024cited by this paper
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
2024cited by this paper
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
2024cited by this paper
The Llama 3 Herd of Models
2024cited by this paper
Scaling Exponents Across Parameterizations and Optimizers
2024cited by this paper
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
2024influential reference
MoEUT: Mixture-of-Experts Universal Transformers
2024cited by this paper
Looped Transformers are Better at Learning Learning Algorithms
2023cited by this paper
Looped Transformers as Programmable Computers
2023cited by this paper
Stable and low-precision training for large-scale vision-language models
2023cited by this paper
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
2023cited by this paper
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
2023cited by this paper
Efficient Training of Language Models using Few-Shot Learning
2023cited by this paper
Spike No More: Stabilizing the Pre-training of Large Language Models
2023cited by this paper
CoTFormer: A Chain of Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference
2023cited by this paper
Deep Thinking Systems: Logical Extrapolation with Recurrent Neural Networks
2023cited by this paper
Cramming: Training a Language Model on a Single GPU in One Day
2022cited by this paper
Path Independent Equilibrium Models Can Better Exploit Test-Time Computation
2022cited by this paper
Solving Quantitative Reasoning Problems with Language Models
2022cited by this paper
OPT: Open Pre-trained Transformer Language Models
2022cited by this paper
Chain of Thought Prompting Elicits Reasoning in Large Language Models
2022cited by this paper
End-to-end Algorithm Synthesis with Recurrent Networks: Extrapolation without Overthinking
2022cited by this paper
Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks
2021cited by this paper
Training Verifiers to Solve Math Word Problems
2021influential reference
Saturated Transformers are Constant-Depth Threshold Circuits
2021cited by this paper
Scaling Vision Transformers
2021cited by this paper
Lessons on Parameter Sharing across Layers in Transformers
2021cited by this paper
Finetuning Pretrained Transformers into RNNs
2021cited by this paper
Measuring Mathematical Problem Solving With the MATH Dataset
2021influential reference
The Depth-to-Width Interplay in Self-Attention.
2020cited by this paper
Revisiting BFloat16 Training
2020cited by this paper
Measuring Massive Multitask Language Understanding
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
Scaling Laws for Neural Language Models
2020cited by this paper
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
2019cited by this paper
PIQA: Reasoning about Physical Commonsense in Natural Language
2019cited by this paper
Depth-Adaptive Transformer
2019cited by this paper
An Adversarial Winograd Schema Challenge at Scale
2019cited by this paper
Efficient Training of BERT by Progressively Stacking
2019cited by this paper
HellaSwag: Can a Machine Really Finish Your Sentence?
2019influential reference
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
2018cited by this paper
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
2018cited by this paper
Network Morphism
2016cited by this paper
Adaptive Computation Time for Recurrent Neural Networks
2016cited by this paper
Deep Residual Learning for Image Recognition
2015cited by this paper
Net2Net: Accelerating Learning via Knowledge Transfer
2015cited by this paper
Extensions of recurrent neural network language model
2011cited by this paper
The Recurrent Temporal Restricted Boltzmann Machine
2008cited by this paper
Loss Functions for Discriminative Training of Energy-Based Models
2005cited by this paper
Recurrent nets that time and count
2000cited by this paper
An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories
1990cited by this paper
Neural networks and physical systems with emergent collective computational abilities.
1982cited by this paper
Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements
1972cited by this paper

CITED BY

Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?
2026cites this paper
A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs
2026cites this paper
Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning
2026cites this paper
Understanding Dynamic Compute Allocation in Recurrent Transformers
2026cites this paper
Step-resolved data attribution for looped transformers
2026cites this paper
From Growing to Looping: A Unified View of Iterative Computation in LLMs
2026cites this paper