D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Suwhan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee

2025-10-13

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Summary

This paper introduces a new way to train AI robots by first teaching them skills in desktop environments like video games, and then transferring that knowledge to the real world.

What's the problem?

Training robots to perform tasks is really expensive and time-consuming because it requires a lot of real-world practice, which can damage the robot or take a long time to set up. Existing attempts to use simulated environments or specific games haven't been broadly applicable or haven't shared their data publicly, limiting progress.

What's the solution?

The researchers created a system called D2E, which stands for Desktop to Embodied AI. It works in three main steps: first, they developed a tool to standardize data from different desktop interactions, making it much more efficient to store. Second, they built an AI that can learn to predict what will happen in a game based on timing, allowing it to generate lots of training data automatically. Finally, they used this desktop-trained AI as a starting point to quickly teach a real robot how to manipulate objects and navigate, achieving high success rates on standard robot benchmarks.

Why it matters?

This work is important because it offers a much cheaper and more scalable way to train robots. By leveraging the vast amount of data available from desktop environments, like gameplay, we can significantly reduce the need for expensive and time-consuming real-world robot training, making advanced robotics more accessible and practical. The researchers are also making all their tools and data publicly available, which will help other researchers build on their work.

Abstract

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/

View Paper