Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views
Xiangdong Zhang, Shaofeng Zhang, Junchi Yan
2025-09-03
Summary
This paper introduces a new method, Point-PQAE, for teaching computers to understand 3D data represented as point clouds without needing labeled examples. It focuses on improving how these systems learn from the structure of the 3D data itself.
What's the problem?
Current methods for self-supervised learning with point clouds often involve hiding parts of a 3D shape and asking the computer to guess what’s missing. While effective, this approach doesn’t provide enough variety in the learning process. The researchers realized that learning from two different perspectives of the same object could be more challenging and ultimately lead to a better understanding of the 3D structure.
What's the solution?
Point-PQAE tackles this by creating two separate 'views' of a point cloud and then training the computer to reconstruct one view from the other. This 'cross-reconstruction' is harder than reconstructing a point cloud from itself. To make this work, they developed a way to generate these different views by 'cropping' the original point cloud and a new method for representing the relationship between these two views in 3D space. This forces the computer to learn more robust features.
Why it matters?
This research is important because it improves the performance of self-supervised learning for 3D data. By achieving better results than previous methods, Point-PQAE brings us closer to building AI systems that can understand and interact with the 3D world without relying on large amounts of manually labeled data, which is expensive and time-consuming to create.
Abstract
Point cloud learning, especially in a self-supervised way without manual labels, has gained growing attention in both vision and learning communities due to its potential utility in a wide range of applications. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it may thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to self-reconstruction, which enables our method to surpass previous single-modal self-reconstruction methods in 3D self-supervised learning. Specifically, it outperforms the self-reconstruction baseline (Point-MAE) by 6.5%, 7.0%, and 6.7% in three variants of ScanObjectNN with the Mlp-Linear evaluation protocol. The code is available at https://github.com/aHapBean/Point-PQAE.