LitePT: Lighter Yet Stronger Point Transformer

Yuanwen Yue, Damien Robert, Jianyuan Wang, Sunghwan Hong, Jan Dirk Wegner, Christian Rupprecht, Konrad Schindler

2025-12-16

LitePT: Lighter Yet Stronger Point Transformer

Summary

This paper investigates how to best combine convolutional and attention-based building blocks in neural networks designed to process 3D point cloud data, ultimately proposing a more efficient and effective network architecture.

What's the problem?

Current 3D point cloud processing networks often haphazardly mix convolutional layers and attention mechanisms without a clear understanding of when each is most useful. Attention can be computationally expensive, and simply adding it doesn't always improve performance. The core issue is figuring out the optimal way to structure these networks to leverage the strengths of both approaches while minimizing drawbacks like computational cost and memory usage.

What's the solution?

The researchers found that convolutional layers are best for processing the raw, detailed geometry of point clouds in the early stages of the network, while attention mechanisms excel at capturing broader context and meaning in the later, deeper layers. Based on this, they created a new network called LitePT that uses convolutions initially and then transitions to attention. To prevent losing important spatial information when reducing the number of convolutional layers, they also developed a new method called PointROPE to encode the 3D position of points without needing extra training.

Why it matters?

LitePT is a significant improvement over existing methods like Point Transformer V3. It achieves comparable or better performance on various 3D point cloud tasks while being much more efficient – it has fewer parameters, runs faster, and uses less memory. This makes it more practical for real-world applications where computational resources are limited, and it provides a clearer design principle for building future 3D point cloud networks.

Abstract

Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has 3.6times fewer parameters, runs 2times faster, and uses 2times less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.

View Paper