Artemis: Structured Visual Reasoning for Perception Policy Learning

Wei Tang, Yanpeng Sun, Shan Zhang, Xiaofan Li, Piotr Koniusz, Wei Li, Na Zhao, Zechao Li

2025-12-03

Artemis: Structured Visual Reasoning for Perception Policy Learning

Summary

This paper explores how to improve artificial intelligence systems that 'think' while looking at images, specifically when those systems try to explain their reasoning using language.

What's the problem?

Current AI systems often try to explain their image-based decisions by generating chains of thought written in natural language. However, the researchers found that adding this linguistic explanation actually *decreased* the AI's accuracy. The issue isn't that reasoning is bad, but that language isn't the right tool for understanding images. Images require thinking about *where* things are and *what* objects are present, which is hard to do effectively just with words.

What's the solution?

The researchers created a new AI framework called Artemis. Instead of using language to explain its reasoning, Artemis uses 'proposals' – essentially, it identifies objects in the image with labels and bounding boxes (the rectangles drawn around objects). This way, the AI's reasoning is directly tied to what's visually present, making it more accurate and easier to supervise. Artemis is built using an existing AI model called Qwen2.5-VL-3B and performs well on tasks like identifying objects and counting them.

Why it matters?

This work is important because it shows that for visual tasks, AI reasoning should be grounded in spatial understanding – meaning it should focus on where things are and what they are, rather than relying on abstract language. By improving how AI reasons about images, this research could lead to more reliable and capable AI systems for a variety of applications, and even improve general AI performance on tasks that require both vision and language.

Abstract

Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.

View Paper