Detect Anything via Next Point Prediction
Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, Lei Zhang
2025-10-15
Summary
This paper introduces Rex-Omni, a new artificial intelligence model that's really good at identifying objects in images and understanding what's being asked about those objects using natural language.
What's the problem?
Traditionally, object detection relied on models that directly predict the coordinates of objects, like where a box should be drawn around them. Recently, researchers have tried using larger language models that also understand images, but these struggled with accurately finding *all* the objects, often making duplicate detections, and having trouble pinpointing the exact location of objects. Basically, these newer models weren't as precise as the older methods.
What's the solution?
The researchers created Rex-Omni, a powerful language model with 3 billion parameters. They improved it in three main ways: first, they simplified how the model predicts object locations by using a limited set of numbers to represent coordinates. Second, they created lots of training data that specifically teaches the model to connect language descriptions with objects in images. Finally, they used a two-step training process, first training the model with direct supervision and then refining it with a reinforcement learning technique that rewards accurate object placement and discourages duplicate detections.
Why it matters?
Rex-Omni is important because it shows that large language models *can* be excellent at object detection, even surpassing traditional methods. But it's not just about finding objects; because it understands language, it can also perform other tasks like responding to specific requests about objects ('point to the red car'), understanding images based on text prompts, and even interacting with computer interfaces. This moves us closer to AI systems that can truly 'see' and understand the visual world like humans do.
Abstract
Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.