Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou, Rui Chen, Xin Tao, Hengshuang Zhao
2025-12-19
Summary
This paper introduces a new method, called Alchemist, for intelligently choosing which images and text descriptions to use when training AI models that generate images from text. These models, like Imagen and Stable Diffusion, are getting really good, but they still rely heavily on having high-quality training data.
What's the problem?
Current text-to-image AI models are only as good as the data they're trained on. A lot of the data available online is either poor quality, repetitive, or just doesn't help the model learn effectively. Manually sorting through all this data is time-consuming and expensive, and existing automatic methods aren't very sophisticated, often only looking at one simple aspect of the data to decide if it's useful. While some techniques exist for choosing data in language models, no one has adapted them for images.
What's the solution?
Alchemist tackles this problem by learning to rate how helpful each image-text pair is *before* using it for training. It does this by looking at how changing the model's training data affects the model itself – essentially, it figures out which samples have the biggest impact on improving the image generation. It works in two steps: first, a 'rater' estimates each sample's importance using information about how the model's internal settings (gradients) change, and then a 'pruning' step selects the most informative samples for training. This process is done automatically and can handle very large datasets.
Why it matters?
This research is important because it offers a way to train powerful image-generating AI models more efficiently and with better results. By automatically selecting the best data, Alchemist can achieve higher quality images and better performance using only a fraction of the original dataset, saving time and computational resources. It's the first system to use this kind of 'meta-gradient' approach for selecting data in text-to-image models.
Abstract
Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.