Course-Correction: Safety Alignment Using Synthetic Preferences

Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu

2024-07-26

Course-Correction: Safety Alignment Using Synthetic Preferences

Summary

This paper discusses a method called Course-Correction, which aims to improve how large language models (LLMs) handle harmful content by teaching them to correct their mistakes autonomously. It introduces a new dataset and evaluation benchmark to help train these models effectively.

What's the problem?

Large language models can sometimes generate harmful or inappropriate content, which is a serious concern. The challenge is to make these models better at recognizing when they are about to produce harmful content and correcting themselves before it happens. Current methods often require a lot of manual input and still struggle with this task.

What's the solution?

The researchers created a benchmark called C^2-Eval to evaluate how well different LLMs can perform course-correction. They also developed a synthetic dataset named C^2-Syn, which contains 750,000 examples of preferred responses that emphasize correcting mistakes quickly. By training LLMs like Llama2-Chat and Qwen2 on this dataset, they were able to improve the models' ability to correct harmful outputs without losing their overall performance. This approach allows the models to better resist attempts to make them generate harmful content (known as jailbreak attacks).

Why it matters?

This research is important because it enhances the safety of AI systems that use language models, making them more reliable for users. By improving how these models can self-correct, the study helps ensure that AI technologies can be used in a responsible way, reducing the risk of spreading harmful information.

Abstract

The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of course-correction, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the C^2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C^2-Syn, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, Llama2-Chat 7B and Qwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

View Paper