RynnEC: Bringing MLLMs into Embodied World

Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, Deli Zhao

2025-08-21

RynnEC: Bringing MLLMs into Embodied World

Summary

This paper introduces RynnEC, a smart system that helps robots understand and interact with the world through video. It's like giving a robot a good pair of eyes and a brain that can focus on specific parts of what it's seeing to figure things out, even in 3D.

What's the problem?

Robots and AI systems often struggle to deeply understand what's happening in videos, especially when they need to focus on specific objects or areas to perform tasks. Creating enough training data for these systems, particularly for 3D environments, is also really hard because it's expensive and time-consuming.

What's the solution?

The researchers built a new system called RynnEC, which is a video understanding model. It uses a special setup that lets it look at and understand different parts of a video, like picking out and analyzing individual objects. To overcome the data problem, they also developed a way to create training data using videos taken from a robot's own point of view, and they created a new test set called RynnEC-Bench to check how well these systems can understand the world around them.

Why it matters?

This work is important because it moves us closer to creating robots that can truly understand and interact with the physical world in a nuanced way. By allowing AI to focus on specific details in videos and by making it easier to train these systems, RynnEC could lead to more capable and versatile robots that can perform a wider range of tasks more effectively.

Abstract

We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC

View Paper