Measuring General Intelligence with Generated Games

Vivek Verma, David Huang, William Chen, Dan Klein, Nicholas Tomlin

2025-05-14

Measuring General Intelligence with Generated Games

Summary

This paper talks about gg-bench, a new way to test how well AI language models can think and solve problems by having them play games that are created on the spot by other AI models.

What's the problem?

The problem is that most tests for AI reasoning use fixed sets of questions or games, which can get old or even be memorized by the AI, so they don't really show if the AI can handle new and unexpected situations.

What's the solution?

The researchers made a system where an AI writes the rules for brand new two-player games, codes them up so they can be played, and then trains another AI agent to play these games. The language models are then tested by seeing how often they can win against these trained agents in games they've never seen before.

Why it matters?

This matters because it gives a fair and flexible way to measure real intelligence in AI, making sure the models are actually reasoning and not just repeating what they've seen before. It also helps push the development of smarter and more adaptable AI systems.

Abstract

gg-bench is a dynamic benchmark for evaluating general reasoning in language models by generating new game environments and assessing winrates against reinforcement learning agents.

View Paper