SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation
Iman Barati, Mostafa Amiri, Heshaam Faili
2025-09-16

Summary
This paper introduces a new technique called SearchInstruct, which helps create better training data for large language models (LLMs) when you want them to be good at specific tasks or subjects.
What's the problem?
Large language models need a lot of training data to become really good at following instructions and learning from examples. However, getting enough high-quality training data for specialized areas, like medical information or legal documents, is difficult because it's hard to find and often limited in quantity. Simply using existing data isn't always enough to make the model perform well in these specific fields.
What's the solution?
SearchInstruct starts with a small set of questions written by people who understand the specific subject. Then, it uses a large language model to create more questions based on those initial ones. Crucially, for each question, it automatically searches for relevant information from reliable sources to generate accurate and helpful answers. This process builds a larger, better training dataset than you could easily create manually.
Why it matters?
This method is important because it allows us to improve the performance of large language models in specialized areas without needing huge amounts of manually created data. It also shows a way to update existing models with new information efficiently, making them more adaptable and useful in a wider range of applications. The researchers even made their code and data publicly available so others can use and build upon their work.
Abstract
Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)