LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models
Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung
2026-04-07
Summary
This paper investigates how well robots using vision and language understanding can follow instructions when those instructions are worded differently, even if they mean the same thing.
What's the problem?
Current robot control systems that combine vision and language often struggle when given instructions that are paraphrased – meaning the same task is asked for using different words. They are typically trained with limited examples, causing them to memorize specific phrasing instead of truly understanding the task. This research shows that even small changes in wording, like using a synonym, can significantly reduce a robot’s success rate, and the problem isn't usually with the robot *doing* the action, but with it understanding *which* action to plan in the first place.
What's the solution?
The researchers created a new testing set called LIBERO-Para specifically designed to measure how robots handle paraphrased instructions. This test allows them to change the wording and object references independently to pinpoint exactly where the robots are failing. They also developed a new way to measure how difficult a paraphrase is, called PRIDE, which considers both the meaning and the structure of the sentence. This helps to understand if robots are succeeding on easy paraphrases and failing on harder ones.
Why it matters?
This work is important because it highlights a major weakness in current robot control systems. If robots can’t understand instructions given in different ways, they won’t be very useful in real-world situations where people don’t always speak in a perfectly consistent manner. By identifying this problem and providing a way to measure it, the researchers are helping to develop more robust and reliable robot assistants.
Abstract
Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para