VideoGameBunny: Towards vision assistants for video games
Mohammad Reza Taesiri, Cor-Paul Bezemer
2024-07-23

Summary
This paper introduces VideoGameBunny, a specialized model designed to help computers understand images from video games better. It aims to improve how AI interacts with video game content by providing a large dataset and demonstrating that smaller models can perform well.
What's the problem?
Current AI models struggle with understanding video game scenes accurately. They often make mistakes, like misinterpreting what is happening in a game or generating incorrect descriptions of game content. This is especially true for open-source models, which can lack the refinement needed for high-quality performance in this specific area.
What's the solution?
The authors developed VideoGameBunny, which is based on an existing model called Bunny. They created a large dataset consisting of over 185,000 images from 413 different video games and paired these images with instructions and questions. This dataset helps train the model to understand video game visuals better. Their experiments showed that VideoGameBunny, despite being smaller than some leading models, could outperform them in various tasks related to video games.
Why it matters?
This research is important because it opens up new possibilities for AI applications in gaming, such as improving how AI can play games, provide commentary, or assist in debugging. By making AI more effective at understanding video games, it enhances user experiences and could lead to more advanced gaming technologies.
Abstract
Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/