SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi

2025-02-19

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and
Object Manipulation

Summary

This paper talks about SoFar, a new system that helps robots understand and manipulate objects better by using everyday language to describe how objects should be oriented. It's like teaching robots to understand directions the way humans do, making them smarter at handling things in the real world.

What's the problem?

Current AI systems for robots are good at figuring out where objects are, but they struggle with understanding how objects should be positioned or oriented. This makes it hard for robots to do tasks that require precise handling of objects, like plugging in a USB drive or using a knife correctly.

What's the solution?

The researchers created SoFar, which uses natural language to describe object orientations. They made a big dataset called OrienText300K with 3D models of objects labeled with descriptions of how they should be oriented, like 'plug-in direction' for a USB. They then integrated this into a system that helps robots understand both where objects should be and how they should be positioned.

Why it matters?

This matters because it could make robots much more useful in everyday life. By understanding object orientations the way humans do, robots could perform more complex tasks in homes, factories, or anywhere else they need to handle objects carefully. The system showed big improvements in tests, which means robots using SoFar could be much better at manipulating objects in the real world, potentially leading to more capable and helpful robots in various industries and daily life.

Abstract

Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

View Paper