Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs
Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu
2026-02-16
Summary
This paper focuses on how to make large language models (LLMs) better by carefully choosing the data they are trained on *after* their initial training. It introduces a new way to measure how diverse that extra training data is, and then uses that measurement to create even better data for the models.
What's the problem?
Large language models need a lot of extra training data to perform well on specific tasks. Current methods for picking this data focus on how different the *text* looks, but that doesn't always mean the data is actually helpful for improving the model's performance. Simply having varied wording isn't enough; the data needs to cover a wide range of important 'features' the model uses to understand and respond.
What's the solution?
The researchers developed a metric called Feature Activation Coverage (FAC) that looks at which important features within the model are being activated by the training data. They then built a system, FAC Synthesis, that uses this metric to identify what features are missing in a starting set of data and automatically creates new, synthetic data examples that specifically target those missing features. This process essentially fills in the gaps in the model's knowledge.
Why it matters?
This work is important because it provides a more effective way to improve LLMs. Instead of just throwing more data at the problem, it focuses on *smart* data selection and creation. The fact that the identified important features are similar across different types of models (like LLaMA, Mistral, and Qwen) means this approach could be widely applicable and help advance the field of LLM optimization.
Abstract
The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.