From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao

2025-06-17

From Real to Synthetic: Synthesizing Millions of Diversified and
Complicated User Instructions with Attributed Grounding

Summary

This paper talks about a new method that creates millions of diverse and complicated instructions for training large language models by using something called attributed grounding. This means the instructions are connected to specific features or facts to make them more meaningful and varied. By synthesizing such a huge and rich dataset, the method helps models perform really well on tests that measure their understanding and abilities.

What's the problem?

The problem is that training large language models needs a lot of high-quality instruction data that covers many different kinds of tasks and situations. Existing datasets are often limited in variety or complexity, making it hard for models to learn to handle all the diverse requests people might have. Gathering enough detailed and varied real instructions is expensive and time-consuming.

What's the solution?

The solution is to generate synthetic instruction data on a massive scale by grounding instructions in specific attributes, which guides the creation of realistic and complex commands. This attributed grounding helps ensure the instructions are meaningful and cover a broad range of topics and difficulties. The large synthesized dataset is then used to train language models, helping them get better at following complicated and varied user instructions.

Why it matters?

This matters because having more and better instruction data allows AI models to understand and follow user requests more accurately and flexibly. By automating the creation of diversified instructions, this approach makes it easier and cheaper to train powerful language models that can assist with a wide range of tasks, making AI more helpful and reliable in everyday use.

Abstract

The paper presents a method for generating diverse and complex instruction data for large language models using attributed grounding, achieving top performance on benchmarks with a large synthesized dataset.

View Paper