ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
Matteo Merler, Nicola Dainese, Minttu Alakuijala, Giovanni Bonetta, Pietro Ferrazzi, Yu Tian, Bernardo Magnini, Pekka Marttinen
2025-05-20
Summary
This paper talks about ViPlan, a new test that compares how well different AI systems can plan actions based on what they see in images, using either step-by-step logic or just suggesting actions directly.
What's the problem?
The problem is that it's not clear which approach works better when AI needs to understand and act in visual situations, especially when it comes to making sure the actions match what’s actually happening in the images.
What's the solution?
To figure this out, the researchers created ViPlan, a benchmark that lets them test and compare symbolic planning, which uses logical rules, with vision-language models that generate actions directly. They found that using logical rules, or symbolic planning, is better for tasks that require the AI to really understand what’s in the image.
Why it matters?
This matters because it shows that for certain jobs, especially ones where accuracy in understanding images is important, using logical reasoning can help AI make better decisions, which is useful for robots, self-driving cars, and other technology that needs to see and plan at the same time.
Abstract
ViPlan is an open-source benchmark comparing symbolic planning with Vision-Language Models and direct action proposals in visual domains, finding that symbolic planning excels in tasks requiring accurate image grounding.