Rethinking Data Selection at Scale: Random Selection is Almost All You Need
Tingyu Xia, Bowen Yu, Kai Dang, An Yang, Yuan Wu, Yuan Tian, Yi Chang, Junyang Lin
2024-10-15

Summary
This paper discusses a new method called TemporalBench, which is designed to evaluate how well video models understand the timing and order of events in videos.
What's the problem?
Understanding the timing of actions in videos is important for AI models that analyze video content. However, existing benchmarks mainly focus on static images and do not effectively test how well these models grasp the timing and sequence of events in videos. This makes it difficult to assess their true capabilities.
What's the solution?
TemporalBench provides around 10,000 video question-answer pairs based on detailed human annotations of video clips. This benchmark allows for a thorough assessment of various aspects of temporal understanding, such as how often actions occur, the intensity of movements, and the order of events. It supports different tasks like answering questions about videos and generating captions, making it versatile for evaluating multiple types of video models.
Why it matters?
This research is significant because it sets a higher standard for evaluating AI's ability to understand complex video content. By focusing on fine-grained temporal dynamics, TemporalBench can help improve the development of AI systems that need to analyze and generate video content accurately, which is valuable in fields like entertainment, education, and security.
Abstract
Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods those that do not rely on external model assistance on two million scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on high quality data. We also analyzed the limitations of several current approaches, explaining why they perform poorly on large-scale datasets and why they are unsuitable for such contexts. Finally, we found that filtering data by token length offers a stable and efficient method for improving results. This approach, particularly when training on long text data, proves highly beneficial for relatively weaker base models, such as Llama3.