CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Shresth Grover, Priyank Pathak, Akash Kumar, Vibhav Vineet, Yogesh S Rawat

2025-12-17

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Summary

This paper investigates how well large AI models that understand both images and language can handle tasks that require planning a series of actions, like solving a maze or rearranging objects. It focuses on situations where the AI might make mistakes along the way and needs to correct itself.

What's the problem?

Current AI models are really good at complex reasoning, but they haven't been thoroughly tested on tasks that require planning multiple steps in a visual environment. Real-world planning isn't perfect; things go wrong. The challenge is getting these models to not only plan a sequence of actions but also to recognize when they've made a mistake and fix it to still reach the goal. Existing models struggle with this, even when using advanced techniques like breaking down problems into smaller steps or creating visual maps of the situation.

What's the solution?

The researchers created a new set of challenges, called CoSPlan, specifically designed to test an AI's ability to plan and correct errors in visual tasks. They then developed a new technique called Scene Graph Incremental Updates (SGI). This method helps the AI reason through the steps by adding intermediate reasoning points between the starting point and the final goal, making the planning process more clear and manageable. Importantly, SGI doesn't require any additional training of the AI model.

Why it matters?

This work is important because it highlights a weakness in current AI models – their inability to reliably plan and recover from errors in visual tasks. By creating a challenging benchmark and a simple, effective solution like SGI, the researchers are pushing the field towards more robust and practical AI systems that can handle the complexities of the real world, and the SGI method also improves performance on other planning tasks.

Abstract

Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.

View Paper