Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?
Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Dongrui Liu, Yi R. Fung
2026-03-04
Summary
This paper explores a way to automatically create harder math problems for advanced AI models, like those aiming to compete at the International Mathematical Olympiad (IMO) level.
What's the problem?
As AI gets better at math, it's becoming difficult to find enough challenging and new problems to properly train and test these systems. Simply using existing problems isn't enough because the AI might memorize solutions instead of truly learning to *solve* math. There's a need for a constant stream of fresh, difficult problems.
What's the solution?
The researchers used 'code agents' – essentially AI programs that can write and run code – to take existing math problems and automatically make them more complex. They set up a system where these agents could modify problems, then check if the new problems were still solvable and actually harder than the originals. The agents essentially experimented with changes until they created suitable new challenges.
Why it matters?
This work is important because it offers a way to automatically generate a large number of difficult math problems, which is crucial for continuing to improve AI's mathematical reasoning abilities. It shows that AI can be used not just to *solve* problems, but also to *create* them, opening up possibilities for more effective AI training and evaluation in mathematics.
Abstract
As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.