OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu

2025-12-17

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Summary

This paper focuses on the fact that while we're getting really good at building and testing powerful AI language models, we know very little about the data those models are trained on. It introduces a new platform called OpenDataArena to help us understand and evaluate the quality of this training data.

What's the problem?

Currently, the datasets used to train large language models are a mystery. We don't really know what's *in* them, where the information came from, or how different data characteristics affect how the AI behaves. This makes it hard to reproduce results, understand why a model does what it does, and improve the data used for training in a smart way. It's like trying to bake a cake without knowing the ingredients or how they interact.

What's the solution?

The researchers created OpenDataArena, a complete system for evaluating training data. It includes a standardized way to train and test models, a detailed scoring system to assess data quality based on many different factors, a tool to trace the origins of data, and open-source software so anyone can use and contribute to the platform. They tested it with a huge amount of data – over 120 datasets, 600 training runs, and 40 million data points – to see what they could learn.

Why it matters?

This work is important because it pushes us towards a more scientific approach to building AI. Instead of just randomly trying different datasets, we can now systematically study how data impacts model performance. This could lead to better, more reliable AI models and a deeper understanding of how these models learn, ultimately moving the field from guesswork to a more principled 'Data-Centric AI'.

Abstract

The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.

View Paper