ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks
Liyang He, Yuren Zhang, Ziwei Zhu, Zhenghui Li, Shiwei Tong
2025-10-30
Summary
This paper introduces ChronoPlay, a new system for automatically creating benchmarks to test how well AI models can answer questions about video games, specifically using a technique called Retrieval Augmented Generation (RAG).
What's the problem?
Evaluating AI models in fast-changing areas like gaming is difficult because games are constantly updated with new content, and what players are interested in changes quickly too. Existing ways to test these models aren't good at keeping up with these 'dual dynamics'. Also, to make sure the tests are realistic, the questions need to sound like actual players would ask them, which is hard to achieve automatically.
What's the solution?
The researchers built ChronoPlay, which automatically creates and updates benchmarks for game-related questions. It does this by tracking both official game updates *and* what players are talking about online. It then uses both of these sources to generate questions and answers, ensuring they are both accurate and sound natural. They tested it on three different games to show how it works.
Why it matters?
This work is important because it provides a standardized way to measure how well AI models perform in the gaming world. This will help developers improve these models and create better experiences for players, and it provides a tool for ongoing evaluation as games evolve.
Abstract
Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We instantiate our framework on three distinct games to create the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under these complex and realistic conditions. Code is avaliable at: https://github.com/hly1998/ChronoPlay.