Novel Object 6D Pose Estimation with a Single Reference View
Jian Liu, Wei Sun, Kai Zeng, Jin Zheng, Hui Yang, Lin Wang, Hossein Rahmani, Ajmal Mian
2025-03-11
Summary
This paper talks about SinRef-6D, a method that helps robots and AR apps figure out an object’s exact position and rotation in 3D space using just one photo of the object, instead of needing multiple angles or complex 3D models.
What's the problem?
Existing methods for detecting an object’s 3D position and angle (called 6D pose) rely on detailed 3D models or many photos from different angles, which are hard to get and slow to process.
What's the solution?
SinRef-6D uses a smart AI approach that compares a single reference photo to new views of the object, refining its guess step-by-step using lightweight models (SSMs) that focus on key details like shapes and colors, making it fast and accurate.
Why it matters?
This makes robots and AR tools more practical for real-world use, like helping robots pick up objects or letting AR apps place virtual items accurately without needing tons of data or expensive 3D scans.
Abstract
Existing novel object 6D pose estimation methods typically rely on CAD models or dense reference views, which are both difficult to acquire. Using only a single reference view is more scalable, but challenging due to large pose discrepancies and limited geometric and spatial information. To address these issues, we propose a Single-Reference-based novel object 6D (SinRef-6D) pose estimation method. Our key idea is to iteratively establish point-wise alignment in the camera coordinate system based on state space models (SSMs). Specifically, iterative camera-space point-wise alignment can effectively handle large pose discrepancies, while our proposed RGB and Points SSMs can capture long-range dependencies and spatial information from a single view, offering linear complexity and superior spatial modeling capability. Once pre-trained on synthetic data, SinRef-6D can estimate the 6D pose of a novel object using only a single reference view, without requiring retraining or a CAD model. Extensive experiments on six popular datasets and real-world robotic scenes demonstrate that we achieve on-par performance with CAD-based and dense reference view-based methods, despite operating in the more challenging single reference setting. Code will be released at https://github.com/CNJianLiu/SinRef-6D.