PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, Beng Chin Ooi

2025-10-28

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Summary

This paper introduces PixelRefer, a new system for making AI understand images and videos in a much more detailed way, focusing on specific objects within a scene rather than just the whole picture.

What's the problem?

Current AI models that understand both images and language are good at getting the general idea of a picture, but they struggle with understanding specific details or reasoning about individual objects within that picture. They treat everything as one big scene, missing important nuances.

What's the solution?

The researchers created PixelRefer, which works by first identifying and isolating objects within an image or video based on where the user points. They developed a special tool called SAOT to create a compact 'summary' of each object. They also realized that the AI doesn't need to look at the whole image all the time, so they made a faster version, PixelRefer-Lite, that focuses on these object summaries. To help the AI learn, they also created a large dataset of images and instructions specifically designed for understanding objects.

Why it matters?

This work is important because it allows AI to understand images and videos more like humans do – by focusing on specific objects and their relationships. This could lead to improvements in areas like robotics, image editing, and more accurate visual question answering, and it does so with a focus on making the process more efficient and less computationally expensive.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.

View Paper