DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

Sunghee Jung, Donghun Lee, Shinbok Lee, Gaeun Seo, Daniel Lee, Byeongil Ko, Junrae Cho, Kihyun Kim, Eunggyun Kim, Myeongcheol Shin

2025-04-08

DiaTool-DPO: Multi-Turn Direct Preference Optimization for
Tool-Augmented Large Language Models

Summary

This paper talks about DiaTool-DPO, a smarter AI assistant that learns to use tools and answer questions more naturally by practicing good conversation habits and avoiding bad ones, like a student learning from both right and wrong answers.

What's the problem?

Current AI tools often struggle with confusing questions or requests they can't handle, either giving bad answers or failing to say 'I don't know' properly.

What's the solution?

DiaTool-DPO trains AI using examples of good and bad conversations, teaching it to recognize different question types and respond appropriately while sticking to its capabilities.

Why it matters?

This helps create AI assistants that work better in real-world situations, like customer service bots that can politely decline impossible requests or clarify unclear questions.

Abstract

Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM's dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o's performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

View Paper