LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung, Mohit Bansal, Nanyun Peng
2024-10-10

Summary
This paper introduces DeCRIM, a self-correction method for large language models (LLMs) that helps them better follow complex instructions with multiple constraints.
What's the problem?
LLMs often struggle to follow instructions that have multiple requirements, such as writing a social media post in a funny tone without using hashtags. Despite advancements, many evaluations of LLMs focus on artificial data rather than real user requests, which can lead to poor performance in real-world applications. For example, even advanced models like GPT-4 fail to meet at least one requirement in over 21% of cases.
What's the solution?
To tackle this issue, the authors developed RealInstruct, a benchmark that uses actual user queries to evaluate how well LLMs can handle complex instructions. They also created the Decompose, Critique, and Refine (DeCRIM) pipeline, which breaks down instructions into smaller parts and uses a critic model to assess and improve the LLM's responses. This process helps the model refine its answers until they meet all the specified constraints.
Why it matters?
This research is important because it improves how LLMs can understand and respond to complex instructions in real-world scenarios. By enhancing the ability of these models to follow detailed user requests, DeCRIM can lead to more effective AI assistants that provide better support in various applications, from customer service to creative writing.
Abstract
Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.