ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng

2025-10-15

ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Summary

This paper focuses on making large language models, specifically those that can 'think through' problems step-by-step (called long chain-of-thought reasoning), more trustworthy. Current models are good at getting the right answer and doing it efficiently, but they don't explain *how* they got there in a way humans can easily understand or verify.

What's the problem?

When these AI models solve problems, it's often like looking into a 'black box'. We see the answer, but not the reasoning process. This makes it hard to trust the answer, especially if it's wrong. The problem is that existing methods for improving these models prioritize accuracy and speed, ignoring whether the reasoning is clear, honest about its sources, and consistently reliable.

What's the solution?

The researchers developed a new training method called ReFIne. This method teaches the AI to not only solve problems but also to show its work in a structured way, using tags and high-level plans. It also forces the model to highlight the specific information it used to make each step, and even to assess how confident it is in its own reasoning and final answer. They tested ReFIne on different sizes of Qwen3 models using math problems.

Why it matters?

This work is important because it shows that building trustworthy AI isn't just about getting the right answer. It's about making the reasoning process transparent and understandable. The results demonstrate significant improvements in interpretability, faithfulness (being honest about its reasoning), and reliability (providing confidence estimates), suggesting that AI models should be optimized for these qualities alongside accuracy.

Abstract

Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation's soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine

View Paper