Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling
Xinlei Yu, Zhangquan Chen, Yudong Zhang, Shilin Lu, Ruolin Shen, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Shuicheng Yan
2025-08-08
Summary
This paper talks about MACT, a system where multiple AI agents work together to understand visual documents and answer questions about them more effectively.
What's the problem?
The problem is that understanding complex visual documents and answering questions about them is difficult for a single AI model because it requires handling different types of information and reasoning skills.
What's the solution?
The solution was to design a framework where four specialized agents collaborate, each focusing on different parts of the task, combined with a mixed reward system that encourages better performance during test time with fewer parameters.
Why it matters?
This matters because it leads to more accurate and efficient AI systems that can help people extract information from documents and answer questions quickly, improving productivity in many areas like business and education.
Abstract
MACT, a Multi-Agent Collaboration framework with Test-Time scaling, enhances visual document understanding and VQA by using four specialized agents and mixed reward modeling, achieving superior performance with reduced parameters.