TCIA: A Task-Centric Instruction Augmentation Method for Instruction Finetuning
Simin Ma, Shujian Liu, Jun Tan, Yebowen Hu, Song Wang, Sathish Reddy Indurthi, Sanqiang Zhao, Liwei Wu, Jianbing Han, Kaiqiang Song
2025-08-29
Summary
This paper focuses on improving how we create training data for large language models (LLMs) so they perform better on specific tasks, rather than just being generally good at everything.
What's the problem?
Currently, methods for creating diverse training data for LLMs often focus on making the data varied and high-quality, but they don't always ensure the data is actually relevant to the specific job the LLM will be doing. Most real-world applications don't need a model that can do *everything*; they need a model that's really good at *one* thing. Ignoring this task-specific need limits how well LLMs perform in practical situations.
What's the solution?
The researchers developed a new method called Task Centric Instruction Augmentation (TCIA). TCIA works by thinking of instructions as a combination of what the user is asking for (the 'query') and the specific rules or limitations (the 'constraints'). This allows the system to create a lot of different instructions that are all focused on the particular task at hand, ensuring both variety and relevance. Essentially, it expands on existing instructions in a smart way that keeps them aligned with the task.
Why it matters?
This work is important because it shows a way to significantly improve the performance of open-source LLMs on real-world tasks – by an average of almost 9% in their tests! Importantly, it does this *without* making the models worse at following general instructions. This means TCIA is a practical and efficient way to customize LLMs for specific applications, potentially even surpassing the performance of more expensive, closed-source models.
Abstract
Diverse instruction data is vital for effective instruction tuning of large language models, as it enables the model to generalize across different types of inputs . Building such diversified instruction dataset is an essential step in this process. Existing approaches often leverage large language models to automatically explore and generate diverse instructions, ensuring both data diversity and quality. However, they tend to overlook an important factor in real-world applications: on-task relevance. In practice, only a few real-world applications require a truly general-purpose model; most benefit from task-specific knowledge tailored to their particular use case. Therefore, it is vital to develop instruction augmentation methods that not only maintain diversity but are also optimized for specific, real-world scenarios. We thus introduce Task Centric Instruction Augmentation (TCIA), a framework that systematically expands instructions while preserving both diversity and task alignment. By representing instructions in a discrete query-constraints space, TCIA creates a rich set of task-relevant instructions and enables models to generalize to these task-specific instructions without sacrificing overall performance. Experiments show that TCIA improves open-source LLMs' performance by an average of 8.7% across four real-world, task-specific applications, and in some cases outperforming leading closed-source models. These improvements do not compromise general instruction-following ability, making TCIA a scalable and efficient solution for adapting LLMs to real-world, task-focused applications.