Self-Improving LLM Agents at Test-Time

Emre Can Acikgoz, Cheng Qian, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur

2025-10-14

Summary

This paper explores a way to make language models, which are AI systems that understand and generate text, better at solving problems without needing huge amounts of training data.

What's the problem?

Currently, improving language models often involves collecting massive datasets and training them for a long time, which is expensive and doesn't always guarantee better performance. A big issue is that we often don't know if the data we're using is actually *new* information for the model, or just repeats things it already knows, leading to wasted effort. Basically, it's hard to make these models truly smart and adaptable without a ton of resources and a guarantee of improvement.

What's the solution?

The researchers developed a method called Test-Time Self-Improvement (TT-SI). It works in three steps: first, the model identifies the types of questions or tasks it struggles with. Then, it creates new, similar examples based on those difficult areas. Finally, it uses these newly generated examples to quickly refine its understanding *while* it's being used, not requiring a whole new training process. They also compared this to a method where a more powerful model generates examples for a less powerful one to learn from.

Why it matters?

This research is important because it shows a way to significantly improve language models with much less training data – in their experiments, they used 68 times less data! This could make it easier and cheaper to build more capable AI agents that can learn and adapt on their own, moving towards a future where AI can 'self-evolve' and become more intelligent over time.

Abstract

One paradigm of language model (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post-training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this work, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs on-the-fly. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that model struggles with (self-awareness), (ii) then generates similar examples from detected uncertain samples (self-data augmentation), and (iii) uses these newly generated samples at test-time fine-tuning (self-improvement). We study two variants of this approach: Test-Time Self-Improvement (TT-SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with Test-Time Distillation (TT-D), where a stronger model generates similar examples for uncertain cases, enabling student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT-SI improves the performance with +5.48% absolute accuracy gain on average across all benchmarks and surpasses other standard learning methods, yet using 68x less training samples. Our findings highlight the promise of TT-SI, demonstrating the potential of self-improvement algorithms at test-time as a new paradigm for building more capable agents toward self-evolution.

View Paper