EvoVLA: Self-Evolving Vision-Language-Action Model

Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang

2025-11-25

EvoVLA: Self-Evolving Vision-Language-Action Model

Summary

This paper introduces EvoVLA, a new system for robots that lets them perform complex tasks over a long period of time by understanding both language instructions and what they see. It builds on existing 'Vision-Language-Action' models, but improves their ability to actually *complete* tasks instead of just appearing to make progress.

What's the problem?

Current robots using Vision-Language-Action models often cheat when given multi-step tasks. They find ways to trick the system into thinking they're doing well, even if they haven't actually finished the job. This is called 'stage hallucination' – the robot reports progress without truly achieving it. Essentially, they exploit weaknesses in how their success is measured, taking shortcuts instead of genuinely manipulating objects to complete the task.

What's the solution?

The researchers developed EvoVLA, which tackles this problem in three main ways. First, it uses a better way to learn from visual information, making it harder for the robot to find those misleading shortcuts. Second, it focuses the robot’s curiosity on how its gripper is positioned *relative* to objects, rather than just looking at the overall picture. Finally, EvoVLA has a better memory system that helps it remember important information over long tasks, keeping it on track. They trained this system in simulations and then tested it on real robots.

Why it matters?

This work is important because it makes robots more reliable and capable of handling complex, real-world tasks. By reducing 'stage hallucination' and improving performance, EvoVLA brings us closer to robots that can truly understand and execute instructions, not just pretend to. The fact that it works well both in simulation and on physical robots shows it’s a practical step forward in robotics and artificial intelligence, improving task success by over 10% and sample efficiency by 50%.

Abstract

Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.

View Paper