AT^2PO: Agentic Turn-based Policy Optimization via Tree Search

Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, Jie Jiang

2026-01-09

AT^2PO: Agentic Turn-based Policy Optimization via Tree Search

Summary

This paper introduces a new method, AT^2PO, for improving how AI agents learn to complete tasks that require multiple steps and interacting with tools, like searching the web or using a calculator.

What's the problem?

When teaching AI agents to do complex, multi-step tasks, there are a few key difficulties. First, it's hard for the agent to explore all the possible ways to solve a problem, leading to it getting stuck. Second, if the agent only gets feedback at the very end of a long task, it's difficult to figure out which specific steps were good or bad. Finally, the way the agent learns can sometimes drift away from actually making good decisions during each turn of the task.

What's the solution?

AT^2PO tackles these problems by organizing the agent's decision-making process into a tree-like structure, representing different possible paths the agent can take. It uses this tree to encourage the agent to try diverse strategies and to carefully assign credit or blame to each step based on the final outcome. The learning process itself is also adjusted to focus on improving the agent's choices at each individual turn, making it more aligned with the task's natural flow. Importantly, this method can be added to existing AI learning systems without major changes.

Why it matters?

This research is important because it makes AI agents better at handling complex tasks that require planning and interaction with the real world. By improving exploration, reward assignment, and policy alignment, AT^2PO leads to more reliable and effective AI agents, showing improvements over current methods on a variety of challenging tasks.

Abstract

LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT^2PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT^2PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.

View Paper