Language Models can Self-Improve at State-Value Estimation for Better Search
Ethan Mendes, Alan Ritter
2025-03-05
Summary
This paper talks about a new method called self-taught lookahead, which helps AI language models get better at solving multi-step problems without needing expensive human input or real-world rewards
What's the problem?
Usually, to teach AI to solve complex tasks, you need to give it lots of examples or rewards for doing things right. This is expensive and time-consuming, especially for tasks that involve interacting with websites or other complex environments
What's the solution?
The researchers created self-taught lookahead, which lets the AI learn from its own attempts at solving problems. It does this by predicting how good different choices might be, based on what it's learned from trying similar things before. This helps the AI make better decisions about what to do next when solving a problem
Why it matters?
This matters because it could make AI much better at solving complex, real-world problems without needing constant human guidance. The method works almost as well as using the most advanced AI models, but it's much cheaper and faster. This could lead to more capable AI assistants that can handle a wider range of tasks more efficiently
Abstract
Collecting ground truth task completion rewards or human demonstrations for multi-step reasoning tasks is often cost-prohibitive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead, a self-supervised method that leverages state-transition dynamics to train a value model capable of effectively guiding language model-controlled search. We find that moderately sized (8 billion parameters) open-weight <PRE_TAG>value models</POST_TAG> improved with self-taught lookahead can match the performance of using a frontier LLM such as gpt-4o as the value model. Furthermore, we find that self-taught lookahead improves performance by 20% while reducing costs 37x compared to previous LLM-based tree search, without relying on ground truth rewards.