Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang

2025-11-03

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Summary

This paper focuses on improving how well large vision-language models, which are AI systems that understand both images and text, can grasp spatial relationships – things like where objects are located and how they relate to each other.

What's the problem?

Currently, teaching these models spatial understanding requires a lot of labeled data, special tools, or very controlled environments. Getting this kind of supervision is expensive and limits how much the models can learn because it doesn't easily scale to real-world situations. Basically, it's hard to get these models to 'understand' space without a lot of human help.

What's the solution?

The researchers developed a new method called Spatial-SSRL. It's a way to train the models using 'self-supervision,' meaning the models learn from the images themselves without needing humans to label everything. They created five different tasks based on things like reordering parts of an image, recognizing flipped images, filling in missing parts, understanding depth, and predicting the 3D position of objects. The models can check their own answers to these tasks, providing a 'reward' signal for learning. This allows the model to improve its spatial reasoning skills automatically.

Why it matters?

This work is important because it shows that you can significantly improve a model’s ability to understand spatial relationships without relying on expensive and limited human-provided labels. It opens the door to building more intelligent AI systems that can better interact with and understand the physical world around them, and it does so in a way that can be scaled up to handle much larger datasets and more complex scenarios.

Abstract

Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.

View Paper