DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue
Yijun Liao
2025-12-09
Summary
This paper focuses on improving how chatbots handle long conversations, specifically addressing a problem where the chatbot gets 'stuck' on earlier parts of the discussion and struggles to adapt to new requests or changing goals from the user.
What's the problem?
In long conversations, chatbots often exhibit 'State Inertia'. Imagine you start talking about movies, and then switch to planning a party, but the chatbot keeps bringing up movies. This happens because the chatbot's internal understanding gets overly influenced by the initial topic, making it hard to shift focus and resolve conflicting information as the conversation evolves. Existing methods often try to fix this by drastically changing the chatbot's core programming, which can harm its overall abilities.
What's the solution?
The researchers developed a new technique called DZ-TDPO. It's a way to 'guide' the chatbot's attention without completely rewriting it. DZ-TDPO uses two main ideas: first, it identifies potential conflicts in the conversation, and second, it subtly adjusts how the chatbot focuses on different parts of the conversation over time, giving more weight to recent information and less to older, potentially irrelevant details. This is done through 'dynamic KL constraints' and 'temporal attention bias', which essentially means carefully controlling where the chatbot looks for information.
Why it matters?
This work is important because it shows we can build chatbots that are better at handling complex, evolving conversations without sacrificing their general knowledge or abilities. The research demonstrates that by carefully managing the chatbot's attention, we can overcome the 'State Inertia' problem and create more natural and helpful conversational experiences. They also found that larger models need less 'fixing' than smaller ones, suggesting that simply increasing model size can partially address this issue.
Abstract
Long-context dialogue systems suffer from State Inertia, where static constraints prevent models from resolving conflicts between evolving user intents and established historical context. To address this, we propose DZ-TDPO, a non-destructive alignment framework that synergizes conflict-aware dynamic KL constraints with a calibrated temporal attention bias. Experiments on the Multi-Session Chat (MSC) dataset demonstrate that DZ-TDPO achieves state-of-the-art win rates (55.4% on Phi-3.5) while maintaining robust zero-shot generalization. Our scaling analysis reveals a "Capacity-Stability Trade-off": while smaller models incur an "alignment tax" (perplexity surge) to overcome historical inertia, the larger Qwen2.5-7B model achieves 50.8% win rate with negligible perplexity overhead. This confirms that TAI can be alleviated via precise attention regulation rather than destructive weight updates, preserving general capabilities (MMLU) across model scales. Code and data are available: https://github.com/lyj20071013/DZ-TDPO