Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai

2025-11-26

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Summary

This paper introduces a new challenge for AI agents that involves understanding instructions, figuring out where things are in a 3D world, and then planning the most efficient way to complete a task, much like how a person would organize chores around the house.

What's the problem?

Current AI datasets for task scheduling are too simple. They don't require the AI to think about things like how to do multiple tasks *at the same time* to save time, or how to really understand the 3D layout of a room to plan movements. Basically, they miss important aspects of real-world planning that come from fields like operations research and spatial reasoning.

What's the solution?

The researchers created a new dataset called ORS3D-60K, which includes 60,000 complex tasks in 4,000 realistic 3D scenes. They also developed an AI model called GRANT, which uses a special method to figure out the best order to do things and then translate that plan into actions the AI can take in the 3D world. GRANT is designed to understand language, locate objects, and optimize for speed.

Why it matters?

This work is important because it pushes AI closer to being able to help us with real-world tasks. By forcing AI to deal with the complexities of 3D spaces and efficient planning, it can become more useful in areas like robotics, home assistance, and even managing complex operations in warehouses or factories.

Abstract

Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT

View Paper