ObjectReact: Learning Object-Relative Control for Visual Navigation

Sourav Garg, Dustin Craggs, Vineeth Bhat, Lachlan Mares, Stefan Podgorski, Madhava Krishna, Feras Dayoub, Ian Reid

2025-09-12

ObjectReact: Learning Object-Relative Control for Visual Navigation

Summary

This paper introduces a new way for robots to navigate using only a camera and a map of the environment, focusing on understanding the environment through objects rather than just raw images.

What's the problem?

Traditional methods for robot navigation using cameras rely on comparing what the robot *sees* to its goal image. This approach is tricky because what a camera sees changes drastically based on the robot’s position and how it’s built. Essentially, the robot gets confused if it looks at the same place from a slightly different angle or with a different camera setup. Relying solely on images makes it hard for a robot to generalize its learning to new situations or even different robots.

What's the solution?

The researchers developed a system called 'ObjectReact' that helps robots navigate by focusing on *objects* within the environment instead of the images themselves. They created a special map that represents the environment as a network of objects and their relationships. The robot then plans its path based on these objects, creating a 'WayObject Costmap'. This allows the robot to learn how to reach goals based on objects, making it less sensitive to changes in viewpoint or robot design. The robot learns to react to objects, hence the name 'ObjectReact', and doesn't need to constantly compare images.

Why it matters?

This work is important because it makes robot navigation more robust and adaptable. By focusing on objects, the robot can navigate new routes without needing to be specifically trained on them, and it can easily transfer its skills to different robots or environments. It also shows that robots can learn to navigate effectively even without detailed 3D maps, which are often difficult and expensive to create, and even works well when transferred from simulated environments to the real world.

Abstract

Visual navigation using only a single camera and a topological map has recently become an appealing alternative to methods that require additional sensors and 3D maps. This is typically achieved through an "image-relative" approach to estimating control from a given pair of current observation and subgoal image. However, image-level representations of the world have limitations because images are strictly tied to the agent's pose and embodiment. In contrast, objects, being a property of the map, offer an embodiment- and trajectory-invariant world representation. In this work, we present a new paradigm of learning "object-relative" control that exhibits several desirable characteristics: a) new routes can be traversed without strictly requiring to imitate prior experience, b) the control prediction problem can be decoupled from solving the image matching problem, and c) high invariance can be achieved in cross-embodiment deployment for variations across both training-testing and mapping-execution settings. We propose a topometric map representation in the form of a "relative" 3D scene graph, which is used to obtain more informative object-level global path planning costs. We train a local controller, dubbed "ObjectReact", conditioned directly on a high-level "WayObject Costmap" representation that eliminates the need for an explicit RGB input. We demonstrate the advantages of learning object-relative control over its image-relative counterpart across sensor height variations and multiple navigation tasks that challenge the underlying spatial understanding capability, e.g., navigating a map trajectory in the reverse direction. We further show that our sim-only policy is able to generalize well to real-world indoor environments. Code and supplementary material are accessible via project page: https://object-react.github.io/

View Paper