Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing

Yuguang Yue, Irakli Salia, Samuel Hunt, Chris Green, Wenzhe Shi, Jonathan J Hunt

2026-01-09

Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing

Summary

This research focuses on creating a powerful AI model that can play video games well, specifically by learning from how humans play. It's about making these AI models accessible and understanding how to build better ones.

What's the problem?

Traditionally, building AI to play games at a human level required a lot of complex programming. Also, it wasn't always clear how much data and how big of a model (the AI's 'brain') were needed to achieve good performance, or if simply making things bigger actually led to *better* understanding of the game, not just better mimicking of human actions.

What's the solution?

The researchers developed a method for training a video game AI by showing it tons of recordings of people playing. They released all the data, the code used to train the AI, and the AI itself so others can use and build upon their work. They then experimented with different sizes of AI models and amounts of training data to see how performance changed, and importantly, how well the AI actually *understood* the game's rules and cause-and-effect relationships.

Why it matters?

This work is important because it provides a clear, open-source recipe for creating strong game-playing AI. It also helps us understand how to scale up AI models effectively – knowing that simply making a model bigger isn't always the answer, and that increasing both data and model size can lead to AI that doesn't just copy human behavior, but actually learns how the game world works.

Abstract

Behavior cloning is enjoying a resurgence in popularity as scaling both model and data sizes proves to provide a strong starting point for many tasks of interest. In this work, we introduce an open recipe for training a video game playing foundation model designed for inference in realtime on a consumer GPU. We release all data (8300+ hours of high quality human gameplay), training and inference code, and pretrained checkpoints under an open license. We show that our best model is capable of playing a variety of 3D video games at a level competitive with human play. We use this recipe to systematically examine the scaling laws of behavior cloning to understand how the model's performance and causal reasoning varies with model and data scale. We first show in a simple toy problem that, for some types of causal reasoning, increasing both the amount of training data and the depth of the network results in the model learning a more causal policy. We then systematically study how causality varies with the number of parameters (and depth) and training steps in scaled models of up to 1.2 billion parameters, and we find similar scaling results to what we observe in the toy problem.

View Paper