FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, Rudolf Lioutikov

2025-09-15

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Summary

This paper introduces a new approach to building AI systems that allow robots to understand instructions involving both vision (what they see) and language (what they're told to do), and then take appropriate actions. They've created a system called FLOWER that's much more efficient than previous methods, meaning it requires less computing power and data to learn.

What's the problem?

Existing AI systems for robots that combine vision, language, and action are incredibly large and require a huge amount of computational resources – think billions of parameters and massive datasets – to work well. This makes them impractical for many real-world applications because not everyone has access to supercomputers or enormous amounts of training data. Essentially, these systems are too big and expensive to be widely used.

What's the solution?

The researchers tackled this problem in two main ways. First, they streamlined how the system processes information by intelligently removing unnecessary parts of the language model, focusing its power on the most important aspects. Second, they developed a way to adapt the system specifically to the action the robot needs to perform, rather than using a one-size-fits-all approach. These improvements allowed them to create FLOWER, a system with significantly fewer parameters than previous models, yet still capable of strong performance.

Why it matters?

This work is important because it makes advanced robotics more accessible. By creating a more efficient system, they’ve lowered the barrier to entry for researchers and developers who want to build robots that can understand and respond to complex instructions. FLOWER’s performance on a variety of tasks, both in simulated and real-world environments, demonstrates its potential to be a practical solution for a wide range of robotic applications, and it sets a new standard for performance in this area.

Abstract

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across 190 tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark. Demos, code and pretrained weights are available at https://intuitive-robots.github.io/flower_vla/.

View Paper