SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu

2026-01-05

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Summary

This paper introduces a new system called SenseNova-MARS that makes image and text-based AI models, known as Vision-Language Models, much better at solving complex problems that require both visual understanding and searching for information.

What's the problem?

Current AI models are good at thinking through problems step-by-step using text, or at using tools individually, but they struggle to combine these things smoothly, especially when dealing with detailed images and needing to find information online. They aren't as flexible as humans when it comes to using tools like image search and cropping *while* they're still reasoning about a problem.

What's the solution?

The researchers created SenseNova-MARS, which uses a technique called reinforcement learning to train the AI model to intelligently use tools like image search, text search, and image cropping together. A new algorithm, BN-GSPO, was developed to make this training process more stable and effective. They also created a challenging new test benchmark, HR-MMSearch, with high-resolution images and questions that require a lot of knowledge and searching.

Why it matters?

SenseNova-MARS significantly improves the performance of these AI models, even surpassing some of the best proprietary models like Gemini and GPT-5 on certain tasks. This is a big step towards creating AI that can truly act as an 'agent,' intelligently using tools and reasoning to solve complex, real-world problems, and the researchers are sharing their code and data to help others build on this work.

Abstract

While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

View Paper