Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

Tianyi Bai, Ling Yang, Zhen Hao Wong, Jiahui Peng, Xinlin Zhuang, Chi Zhang, Lijun Wu, Qiu Jiantao, Wentao Zhang, Binhang Yuan, Conghui He

2024-10-14

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

Summary

This paper discusses a new approach called multi-agent collaborative data selection, which helps improve how large language models (LLMs) are trained by efficiently choosing the best data for pretraining.

What's the problem?

Training large language models requires a lot of data, but not all data is useful. Existing methods for selecting data often conflict with each other, making it hard to find the best combination of data to use. This inefficiency can slow down the training process and affect the model's performance.

What's the solution?

The authors propose a framework where different data selection methods act as independent agents that work together to choose the most appropriate data for training. An 'agent console' is used to coordinate these agents, allowing them to share information and resolve conflicts during the training process. This collaborative approach leads to better data efficiency and faster training times.

Why it matters?

This research is important because it shows how using multiple perspectives in data selection can enhance the training of LLMs. By improving how models are pretrained, this method could lead to better performance in various applications, making AI systems more effective and reliable.

Abstract

Efficient data selection is crucial to accelerate the pretraining of large language models (LLMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LLM pretraining. To tackle this problem, we propose a novel multi-agent collaborative data selection mechanism. In this framework, each data selection method serves as an independent agent, and an agent console is designed to dynamically integrate the information from all agents throughout the LLM training process. We conduct extensive empirical studies to evaluate our multi-agent framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LLM training, and achieves an average performance gain of 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.

View Paper