SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li

2025-03-20

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning
Tasks

Summary

This paper is about training AI models to work together with humans on complex tasks that require multiple back-and-forth interactions.

What's the problem?

It's difficult to train AI to effectively collaborate with humans over multiple turns, especially in tasks that require reasoning and decision-making.

What's the solution?

The researchers developed a new training method called SWEET-RL that helps AI models learn to collaborate with humans more effectively by providing step-by-step feedback during training.

Why it matters?

This work matters because it can lead to AI systems that can work together with humans more seamlessly and effectively, enabling them to tackle complex challenges in fields like programming and design.

Abstract

Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

View Paper