Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev

2025-08-26

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Summary

This research investigates how well large language models can perform complex, multi-step thinking, specifically by looking at how they learn patterns in simple, rule-based systems.

What's the problem?

Large language models are good at many things, but it's not clear *how* they learn to reason through a series of steps to solve a problem. Researchers wanted to understand what makes some models better at this kind of thinking than others, and a key issue is that models often memorize training data instead of actually learning the underlying rules. The study focuses on the fact that while models can predict what happens next in a sequence, they struggle when they need to think several steps ahead.

What's the solution?

The researchers used a simplified world called a 'cellular automata' – think of it like a grid where each cell changes state based on simple rules. They trained different types of models to predict how this grid would evolve over time, but they made sure the rules were random and the starting conditions were also random to prevent the models from simply memorizing the answers. They found that making the models 'deeper' (more layers) helped with sequential calculations, and even more importantly, adding features like recurrence, memory, and allowing the model to do more calculations during testing significantly improved their ability to reason through multiple steps.

Why it matters?

Understanding how models learn to reason is crucial for building more intelligent AI. If we can figure out what architectural changes and training techniques help models think step-by-step, we can create AI systems that are better at solving complex problems, making predictions, and generally behaving more intelligently. This work provides insights into how to improve these reasoning abilities, moving beyond just pattern recognition to true understanding.

Abstract

Reasoning is a core capability of large language models, yet understanding how they learn and perform multi-step reasoning remains an open problem. In this study, we explore how different architectures and training methods affect model multi-step reasoning capabilities within a cellular automata framework. By training on state sequences generated with random Boolean functions for random initial conditions to exclude memorization, we demonstrate that most neural architectures learn to abstract the underlying rules. While models achieve high accuracy in next-state prediction, their performance declines sharply if multi-step reasoning is required. We confirm that increasing model depth plays a crucial role for sequential computations. We demonstrate that an extension of the effective model depth with recurrence, memory, and test-time compute scaling substantially enhances reasoning capabilities.

View Paper