Learning from Failures in Multi-Attempt Reinforcement Learning

Stephen Chung, Wenyu Du, Jie Fu

2025-03-10

Learning from Failures in Multi-Attempt Reinforcement Learning

Summary

This paper talks about a new way to train AI language models using multiple attempts to answer questions, which helps them learn better and improve their reasoning skills

What's the problem?

Current methods for training AI language models usually give them only one chance to answer a question, which doesn't allow them to learn from their mistakes or improve their answers

What's the solution?

The researchers created a 'multi-attempt' training method where the AI model gets several chances to answer each question, receiving feedback after each incorrect attempt. They tested this method on a small language model and found that it performed much better on math problems compared to the traditional single-attempt method

Why it matters?

This matters because it shows a way to make AI language models smarter and better at solving complex problems, even without using huge amounts of data or computing power. It could lead to more efficient and capable AI assistants that can learn from their mistakes and improve their answers over time, just like humans do

Abstract

Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt

View Paper