ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning
Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, Junchi Yan
2025-10-22
Summary
This paper focuses on improving how we train large language models (LLMs) by carefully choosing which pieces of text data to use during a process called supervised fine-tuning. It's about making the training process smarter, not just throwing all the data at the model.
What's the problem?
Currently, methods for picking the best data to train LLMs on at a very detailed level (token-by-token) have two main weaknesses. First, they often need another, already-trained model to help them decide which data is good. Second, they mostly focus on which tokens the model struggles with, measured by 'loss,' and might miss important tokens that don't necessarily cause a high loss but are still crucial for understanding meaning.
What's the solution?
The researchers developed a new method called ssToken. It uses the LLM's *own* past performance (its 'history') to figure out which tokens are most helpful for learning, instead of relying on a separate reference model. It also adds a way to assess how important each token is based on its meaning and how it connects to other words in the sentence, going beyond just looking at loss. This meaning-based assessment works alongside the loss-based one to make better choices.
Why it matters?
This work is important because it makes training LLMs more efficient and effective. By intelligently selecting data, ssToken can achieve better performance than training on all the data, and it does so without needing extra models or complex setups. This means we can build better language models with less computational effort, which is a big deal as these models get larger and more expensive to train.
Abstract
Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose ssToken, a Self-modulated and Semantic-aware Token Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization trajectory, rather than relying on excess loss from an offline-trained reference model as in prior works. We further introduce a semantic-aware, attention-based token importance estimation metric, orthogonal to loss-based selection and providing complementary semantic information for more effective filtering. Extensive experiments across different model families and scales demonstrate that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning, while their integration--ssToken--achieves synergistic gains and further surpasses prior token-level selection methods, delivering performance improvements while maintaining training efficiency.