Towards Pixel-Level VLM Perception via Simple Points Prediction

Tianhui Song, Haoyu Lu, Hao Yang, Lin Sui, Haoning Wu, Zaida Zhou, Zhiqi Huang, Yiping Bao, Y. Charles, Xinyu Zhou, Limin Wang

2026-01-28

Towards Pixel-Level VLM Perception via Simple Points Prediction

Summary

This paper introduces a new method called SimpleSeg that allows AI models, specifically those that combine language and vision, to understand images at a very detailed level – pinpointing exactly where objects are within a picture.

What's the problem?

Traditionally, getting AI to understand *where* things are in an image required complicated setups and specialized parts added to the AI. Existing methods for image segmentation, which is the process of identifying objects and their boundaries, often need extra components beyond the core AI model itself, making them complex and less efficient.

What's the solution?

SimpleSeg takes a surprisingly straightforward approach. Instead of adding complex parts, it teaches the AI to simply *draw* around objects by predicting a series of points that outline their shapes. It does this by treating the task like a text generation problem – the AI predicts the coordinates of these points as a sequence of numbers. To make the drawings accurate, they use a two-step training process that uses reinforcement learning, rewarding the AI when its predicted outlines closely match the actual object boundaries.

Why it matters?

This work is important because it shows that powerful AI models already have the potential to understand images in detail without needing extra, complicated additions. It suggests that we can build more unified and capable AI systems that can handle both language and vision tasks effectively, simply by teaching them to predict points, which is a much simpler concept than previous methods.

Abstract

We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SFtoRL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This work lays out that precise spatial understanding can emerge from simple point prediction, challenging the prevailing need for auxiliary components and paving the way for more unified and capable VLMs. Homepage: https://simpleseg.github.io/

View Paper