System Message Generation for User Preferences using Open-Source Models

Minbyul Jeong, Jungho Cho, Minsoo Khang, Dawoon Jung, Teakgyu Hong

2025-02-18

System Message Generation for User Preferences using Open-Source Models

Summary

This paper talks about SysGen, a new way to create system messages for AI language models. System messages are like instructions that help AI understand what users want, and SysGen makes it easier to create these instructions automatically.

What's the problem?

Right now, it's hard to find good examples of system messages for training AI. Many datasets don't include them, and the ones that do often have strict rules about how they can be used. Making these messages by hand takes a lot of time and effort, which makes it tough to improve AI's ability to understand what people want.

What's the solution?

The researchers created SysGen, a system that can automatically generate good system messages. It takes existing conversations between users and AI and figures out how to create instructions that would make the AI's responses better match what the user wanted. They tested SysGen with different types of AI models and found that it helped the AIs understand and follow instructions better.

Why it matters?

This matters because it could make AI assistants much better at understanding and doing what people ask. By creating better system messages, SysGen helps AI adapt to different situations more easily. This could lead to AI that's more helpful and easier to use in many different areas, from answering questions to helping with complex tasks.

Abstract

System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, specify various output formats and communication styles. Despite such versatility, publicly available data are often lack system messages and subject to strict license constraints in the industry field. Manual labeling of publicly available data with system messages that align with user instructions demands significant resources. In view of such challenges, our work introduces SysGen, a pipeline for generating system messages with better aligned assistant responses from the supervised fine-tuning dataset without system messages. Training on SysGen data has demonstrated substantial improvements in the alignment of model responses with system messages and user instructions, as demonstrated across various open-source models on the Multifacet benchmark, while maintaining minimal impact on other unseen benchmarks such as Open LLM Leaderboard 2. Our qualitative analysis highlights the importance of diverse system messages to ensure better adaptability across different contexts.

View Paper