MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing

2026-04-02

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Summary

This paper introduces MiroEval, a new way to test how well AI systems can do in-depth research, going beyond just checking the final answer.

What's the problem?

Currently, testing AI research abilities focuses too much on the final report and doesn't look at *how* the AI did the research. Existing tests often use simple, made-up tasks and don't handle things like images or videos well. Plus, they quickly become outdated as new information becomes available, meaning the tests aren't reflecting real-world research challenges that people face.

What's the solution?

The researchers created MiroEval, which includes 100 real-world research tasks, some using only text and others using both text and images. They designed it to be updated regularly with new information. MiroEval doesn't just grade the final answer; it also checks how well the AI finds information, verifies facts, and improves its approach during the research process. They tested 13 different AI systems using MiroEval.

Why it matters?

MiroEval is important because it provides a more complete and realistic way to evaluate AI research systems. It helps identify strengths and weaknesses that wouldn't be apparent with older testing methods, and it can guide the development of better AI tools for research, ultimately helping them become more reliable and useful for people.

Abstract

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

View Paper