Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou

2024-07-17

Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

Summary

This paper presents the Data-Juicer Sandbox, a new tool designed to help researchers and developers work together on multimodal data and models more effectively.

What's the problem?

As artificial intelligence (AI) technology advances, especially with large-scale models that can process different types of data (like text, images, and audio), it becomes challenging to optimize these models. Historically, the development of models and the data they use has been done separately, leading to inefficiencies and less effective results.

What's the solution?

The authors propose the Data-Juicer Sandbox as a comprehensive platform that allows for integrated co-development of data and models. This sandbox uses a workflow called 'Probe-Analyze-Refine' to help users quickly experiment with different data and model configurations. By validating this approach with advanced AI models, they achieved significant performance improvements, such as topping the VBench leaderboard. The sandbox also provides insights into how data quality and diversity affect model performance.

Why it matters?

This research is important because it creates a more efficient way for AI developers to create powerful models that can leverage diverse data sources. By improving collaboration between data and model development, the Data-Juicer Sandbox can lead to better AI applications that are more capable and versatile, ultimately benefiting various fields like healthcare, entertainment, and education.

Abstract

The emergence of large-scale multi-modal generative models has drastically advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a novel sandbox suite tailored for integrated data-model co-development. This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models. Our proposed "Probe-Analyze-Refine" workflow, validated through applications on state-of-the-art LLaVA-like and DiT based models, yields significant performance boosts, such as topping the VBench leaderboard. We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior. With the hope of fostering deeper understanding and future progress in multi-modal data and generative modeling, our codes, datasets, and models are maintained and accessible at https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md.

View Paper