Do What? Teaching Vision-Language-Action Models to Reject the Impossible

Wen-Han Hsieh, Elvis Hsieh, Dantong Niu, Trevor Darrell, Roei Herzig, David M. Chan

2025-08-25

Do What? Teaching Vision-Language-Action Models to Reject the Impossible

Summary

This paper explores how robots using both vision and language can handle situations where they're given instructions that don't make sense based on what's actually around them, like being asked to pick up an object that isn't there.

What's the problem?

Imagine you tell a robot to 'pick up the red block,' but there *is* no red block in the room. Current robots with vision and language understanding struggle with these 'false premise' instructions. They might try to find something red anyway, or just fail without explaining why, instead of realizing the instruction itself is flawed. The core issue is that robots need to not only follow instructions but also understand *when* an instruction is impossible given the real world.

What's the solution?

The researchers developed a system called Instruct-Verify-and-Act (IVA). IVA works in three steps: first, it checks if the instruction is possible based on what it 'sees'. If it detects a false premise, it doesn't just stop. Instead, it tries to figure out what the user *meant* to say and asks for clarification or suggests a reasonable alternative. They trained the robot using a large dataset of both correct and incorrect instructions, helping it learn to identify and respond to these situations effectively.

Why it matters?

This research is important because it makes robots more reliable and user-friendly. If a robot can understand when an instruction is flawed and ask for help, it's less likely to make mistakes or frustrate the person giving the commands. This is a big step towards robots that can truly assist us in everyday tasks, even when we aren't perfectly clear in our requests.

Abstract

Recently, Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks. These models rely on multimodal inputs, with language instructions playing a crucial role -- not only in predicting actions, but also in robustly interpreting user intent, even when the requests are impossible to fulfill. In this work, we investigate how VLAs can recognize, interpret, and respond to false-premise instructions: natural language commands that reference objects or conditions absent from the environment. We propose Instruct-Verify-and-Act (IVA), a unified framework that (i) detects when an instruction cannot be executed due to a false premise, (ii) engages in language-based clarification or correction, and (iii) grounds plausible alternatives in perception and action. Towards this end, we construct a large-scale instruction tuning setup with structured language prompts and train a VLA model capable of handling both accurate and erroneous requests. Our approach leverages a contextually augmented, semi-synthetic dataset containing paired positive and false-premise instructions, enabling robust detection and natural language correction. Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines, while increasing successful responses in false-premise scenarios by 50.78%.

View Paper