The key innovation of DIAMOND is its use of a diffusion model to generate the world model, rather than relying on discrete latent variables like many previous approaches. This allows DIAMOND to capture more detailed visual information that can be crucial for reinforcement learning tasks. The diffusion world model takes in the agent's actions and previous frames to predict and generate the next frame of the environment.
DIAMOND was initially developed and tested on Atari games, where it achieved state-of-the-art performance. On the Atari 100k benchmark, which evaluates agents trained on only 100,000 frames of gameplay, DIAMOND achieved a mean human-normalized score of 1.46 - meaning it performed 46% better than human level and set a new record for agents trained entirely in a world model.
The resulting CS:GO world model can be played interactively at about 10 frames per second on an RTX 3090 GPU. While it has some limitations and failure modes, it demonstrates the potential for diffusion models to capture complex 3D environments.
Key features of DIAMOND include:
- Diffusion-based world model that captures detailed visual information
- State-of-the-art performance on Atari 100k benchmark
- Ability to model both 2D and 3D game environments
- End-to-end training of the reinforcement learning agent within the world model
- Use of EDM sampling for stable trajectories with few denoising steps
- Two-stage pipeline for modeling complex 3D environments
- Interactive playability of generated world models
- Open-source code and pre-trained models released for further research