This&That: Language-Gesture Controlled Video Generation for Robot Planning
Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, Jeong Joon Park
2024-07-11

Summary
This paper introduces a new method called This&That, which helps robots learn how to communicate, plan, and carry out various tasks using video generation. It combines visual models with human gestures and language to make robot planning more effective.
What's the problem?
Robots often struggle with understanding human instructions, especially when tasks are complex or uncertain. Traditional methods that rely only on language can be confusing, and robots need better ways to interpret tasks and translate them into actions.
What's the solution?
The authors propose using a method called language-gesture conditioning, which allows robots to generate videos based on both spoken instructions and hand gestures. This approach makes it easier for robots to understand tasks clearly. They also developed a behavioral cloning technique that helps robots learn by mimicking the actions shown in the generated videos. This combination allows robots to plan and execute tasks more effectively.
Why it matters?
This research is important because it improves how robots interact with humans and perform tasks in real-world situations. By using video generation and combining language with gestures, robots can better understand what is expected of them, leading to more successful task completion and enhanced collaboration between humans and machines.
Abstract
We propose a robot learning method for communicating, planning, and executing a wide range of tasks, dubbed This&That. We achieve robot planning for general tasks by leveraging the power of video generative models trained on internet-scale data containing rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intents, and 3) translating visual planning into robot actions. We propose language-gesture conditioning to generate videos, which is both simpler and clearer than existing language-only methods, especially in complex and uncertain environments. We then suggest a behavioral cloning design that seamlessly incorporates the video plans. This&That demonstrates state-of-the-art effectiveness in addressing the above three challenges, and justifies the use of video generation as an intermediate representation for generalizable task planning and execution. Project website: https://cfeng16.github.io/this-and-that/.