Learning to Discover at Test Time

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun

2026-01-23

Summary

This paper explores a new way to use artificial intelligence, specifically large language models (LLMs), to find cutting-edge solutions to difficult scientific problems.

What's the problem?

Traditionally, using LLMs to solve complex problems involves prompting them with information and hoping for a good answer. Existing methods like AlphaEvolve treat the LLM as a 'black box' that doesn't learn during the problem-solving process. The challenge is to allow the LLM to improve its performance *while* actively working on a specific problem, rather than just generalizing to many problems at once, and to find a single, excellent solution instead of many okay ones.

What's the solution?

The researchers developed a method called Test-Time Training to Discover (TTT-Discover). This technique uses reinforcement learning, allowing the LLM to actually *train* itself as it attempts to solve a problem. It's like giving the AI feedback as it works, helping it refine its approach. They focused on problems where the AI receives a score indicating how well it's doing, and designed the learning process to quickly focus on the most promising ideas. They used an open-source LLM, gpt-oss-120b, and a service called Tinker to manage the training process.

Why it matters?

TTT-Discover achieved state-of-the-art results in a variety of fields, including math, computer programming (specifically optimizing code for GPUs), algorithm design, and biology. Importantly, these results were obtained using a publicly available AI model and are reproducible, unlike previous top results that relied on proprietary, closed-source models. This means other researchers can build upon this work, and the method is relatively inexpensive to run, costing only a few hundred dollars per problem.

Abstract

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2times faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

View Paper