ATLaS: Agent Tuning via Learning Critical Steps

Zhixun Chen, Ming Li, Yuxuan Huang, Yali Du, Meng Fang, Tianyi Zhou

2025-03-05

ATLaS: Agent Tuning via Learning Critical Steps

Summary

This paper talks about Q-Eval-100K, a new dataset and evaluation method for measuring how good AI-generated images and videos look and how well they match the text descriptions used to create them

What's the problem?

Current ways of checking AI-generated images and videos aren't always accurate because they don't have enough human-rated examples to learn from. This makes it hard to tell if an AI is really creating good, relevant content

What's the solution?

The researchers made a huge dataset called Q-Eval-100K with 100,000 AI-generated images and videos, rated by humans for quality and how well they match their text descriptions. They used this to create Q-Eval-Score, a new AI that can judge both how good an image or video looks and how well it fits its description, even for long, detailed text

Why it matters?

This matters because as AI gets better at creating images and videos, we need good ways to check if they're actually high quality and match what people ask for. Q-Eval-100K could help make AI-generated content more reliable and useful, which is important as these technologies become more common in our daily lives

Abstract

Large Language Model (LLM) agents have demonstrated remarkable generalization capabilities across multi-domain tasks. Existing agent tuning approaches typically employ supervised finetuning on entire expert trajectories. However, behavior-cloning of full trajectories can introduce expert bias and weaken generalization to states not covered by the expert data. Additionally, critical steps, such as planning, complex reasoning for intermediate subtasks, and strategic decision-making, are essential to success in agent tasks, so learning these steps is the key to improving LLM agents. For more effective and efficient agent tuning, we propose ATLaS that identifies the critical steps in expert trajectories and finetunes LLMs solely on these steps with reduced costs. By steering the training's focus to a few critical steps, our method mitigates the risk of overfitting entire trajectories and promotes generalization across different environments and tasks. In extensive experiments, an LLM finetuned on only 30% critical steps selected by ATLaS outperforms the LLM finetuned on all steps and recent open-source LLM agents. ATLaS maintains and improves base LLM skills as generalist agents interacting with diverse environments.

View Paper