VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press

2025-05-28

VideoGameBench: Can Vision-Language Models complete popular video games?

Summary

This paper talks about VideoGameBench, a new way to test how well AI models that understand both images and words can play popular video games just by looking at the screen and knowing the main goals.

What's the problem?

The problem is that while AI has gotten good at understanding pictures and language separately, it's still really hard for these models to play video games like humans do, because games require quick decisions, understanding complex visuals, and following goals all at once.

What's the solution?

To tackle this, the researchers created VideoGameBench, which challenges AI models to play real video games using only what they see on the screen and a general idea of what they're supposed to accomplish. This helps reveal where these models struggle with skills like reacting in real time and thinking ahead.

Why it matters?

This matters because if AI can get better at playing video games in a human-like way, it could lead to smarter AI that can handle real-world tasks involving vision, decision-making, and adapting to new situations.

Abstract

VideoGameBench evaluates vision-language models' abilities in real-time video game interaction using only visual inputs and high-level objectives, highlighting challenges in human-like skills.

View Paper