Hybrid Attribution Priors for Explainable and Robust Model Training
Zhuoran Zhang, Feng Zhang, Shangyuan Li, Yang Shi, Yuanxing Zhang, Wei Chen, Tengjiao Wang, Kam-Fai Wong
2025-12-18
Summary
This paper focuses on improving how well we can understand *why* small language models (SLMs) make the decisions they do, especially when classifying text. It's about making these models not just accurate, but also transparent and reliable.
What's the problem?
When trying to understand what parts of a text an SLM focuses on to make a classification, current methods often point to common, obvious keywords. While these keywords are relevant to the general topic, they don't help the model distinguish between *similar* categories. For example, if classifying types of fruit, both 'apple' and 'orange' might highlight the word 'fruit,' which isn't helpful in telling them apart. This limits the model's ability to learn subtle differences and make accurate classifications.
What's the solution?
The researchers developed a new technique called Class-Aware Attribution Prior (CAP). CAP helps the model focus on the specific details that *differentiate* between classes, rather than just the general keywords. They also created CAP Hybrid, which combines CAP with existing methods to get a more complete picture of what's important. Essentially, they're teaching the model to pay attention to the right clues to make better decisions and provide clearer explanations for those decisions.
Why it matters?
This work is important because it makes SLMs more trustworthy and useful. By improving interpretability, we can understand *why* a model made a certain prediction, which is crucial in sensitive applications. Also, by increasing robustness, the model is less likely to be fooled by small changes in the input text, making it more reliable in real-world scenarios. This is especially valuable for situations where resources are limited and smaller, more efficient models are needed.
Abstract
Small language models (SLMs) are widely used in tasks that require low latency and lightweight deployment, particularly classification. As interpretability and robustness gain increasing importance, explanation-guided learning has emerged as an effective framework by introducing attribution-based supervision during training; however, deriving general and reliable attribution priors remains a significant challenge. Through an analysis of representative attribution methods in classification settings, we find that although these methods can reliably highlight class-relevant tokens, they often focus on common keywords shared by semantically similar classes. Because such classes are already difficult to distinguish under standard training, these attributions provide insufficient discriminative cues, limiting their ability to improve model differentiation. To overcome this limitation, we propose Class-Aware Attribution Prior (CAP), a novel attribution prior extraction framework that guides language models toward capturing fine-grained class distinctions and producing more salient, discriminative attribution priors. Building on this idea, we further introduce CAP Hybrid, which combines priors from CAP with those from existing attribution techniques to form a more comprehensive and balanced supervisory signal. By aligning a model's self-attribution with these enriched priors, our approach encourages the learning of diverse, decision-relevant features. Extensive experiments in full-data, few-shot, and adversarial scenarios demonstrate that our method consistently enhances both interpretability and robustness.