CritiQ: Mining Data Quality Criteria from Human Preferences
Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
2025-02-27
Summary
This paper talks about a new method called CritiQ that helps choose high-quality data for training AI language models. It uses a smart system to figure out what makes data good by looking at just a few examples that humans have rated.
What's the problem?
Current ways of picking good data for AI training need a lot of human effort and expert knowledge. They often use complicated methods or rely on existing AI models, which can introduce biases and aren't always easy to understand.
What's the solution?
The researchers created CritiQ, which uses something called CritiQ Flow. This system has a manager that comes up with ways to judge data quality, and workers that compare pairs of data examples. It starts with knowledge from previous research and learns from just about 30 examples that humans have rated. CritiQ then creates a scoring system to rate lots of data quickly.
Why it matters?
This matters because it makes it much easier and faster to find good data for training AI. Better data means better AI models that can understand and generate language more accurately. CritiQ also helps explain why certain data is considered good, which is important for trust and further improvements. The researchers showed it works well for different types of tasks like coding, math, and logic, and it helped create AI models that performed better than those trained on randomly chosen data.
Abstract
Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only sim30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.