Multi-Domain Explainability of Preferences
Nitay Calderon, Liat Ein-Dor, Roi Reichart
2025-05-30
Summary
This paper talks about a new way to help AI models explain and predict why people or other AIs prefer certain answers or responses, using a method that works across many different topics or domains.
What's the problem?
The problem is that while AI models are often trained to match human preferences or judge which answers are better, it's usually not clear what specific ideas or concepts are driving these choices, especially when dealing with lots of different subjects.
What's the solution?
The researchers created an automated system that uses AI to discover important concepts that separate preferred answers from rejected ones, then represents those answers using these concepts. They built a special model called Hierarchical Multi-Domain Regression, which can figure out how these concepts influence preferences both in general and for specific topics. This approach not only predicts preferences more accurately than older methods but also makes it easier to understand why certain choices are made.
Why it matters?
This is important because it helps make AI systems more transparent and trustworthy, allowing people to see and understand the reasons behind AI decisions, and making it easier to improve or guide these systems in the future.
Abstract
A new automated method using concept-based vectors and a Hierarchical Multi-Domain Regression model improves preference explanations and predictions for large language models.