MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath

2025-12-19

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Summary

This paper introduces a new way for robots to 'understand' the world around them in homes, focusing on both where things are and what they *do*. It also provides a large dataset and a new robot 'brain' (a model) to help robots plan and complete tasks.

What's the problem?

Currently, robots struggle with everyday tasks in homes because they don't have a good way to represent everything they need to know about a scene. Existing methods either treat objects as just locations, don't account for how objects change over time, or don't focus on the specific information needed for the task at hand. Essentially, robots need a more complete and dynamic understanding of their surroundings to be truly helpful.

What's the solution?

The researchers created something called MomaGraph, which is a detailed 'map' of a scene that combines where objects are, what they're used for, and which parts of them can be interacted with. To make this work, they also built a huge dataset called MomaGraph-Scenes, filled with labeled scenes, and a testing suite called MomaGraph-Bench to measure how well robots can understand these scenes. Finally, they developed MomaGraph-R1, a powerful AI model that can predict these scene graphs and plan tasks based on them.

Why it matters?

This work is important because it moves robots closer to being truly useful assistants in our homes. By giving robots a richer understanding of their environment and the ability to plan tasks effectively, they can handle more complex situations and help with a wider range of chores. The new dataset and model also provide a foundation for future research in this area, allowing other scientists to build even more capable robots.

Abstract

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

View Paper