mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun

2026-03-24

mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

Summary

This paper introduces a new method, called mSFT, to improve how we train large language models on multiple tasks at once.

What's the problem?

When training a language model to do many different things, like translate languages and answer questions, it's common to train it on all those tasks simultaneously. However, some tasks are easier for the model to learn than others. The easier tasks get learned *too* quickly, leading to the model memorizing them instead of truly understanding them, while the harder tasks don't get enough attention and remain poorly learned. This uneven learning process limits the overall performance of the model.

What's the solution?

The researchers developed mSFT, which is a smarter way to combine these tasks during training. Instead of training on everything all the time, mSFT constantly monitors how well the model is learning each task. When it detects that a task is starting to be overfit – meaning the model is just memorizing it – it temporarily removes that task from the training mix. It then goes back to a previous, better version of the model before continuing to train on the remaining tasks. This iterative process ensures that no single task dominates the learning process and that all tasks get a fair chance to be learned well.

Why it matters?

This method is important because it allows us to get better performance out of language models without necessarily needing to use more computing power. In some cases, mSFT even *reduces* the amount of computation needed for training. This means we can build more capable and efficient language models, which is crucial as these models become increasingly important in many areas of technology.

Abstract

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.

View Paper