F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, Jiangmiao Pang
2025-09-10

Summary
This paper introduces a new AI model, F1, designed to better understand and interact with visual environments based on language instructions. It focuses on making AI agents that can 'think ahead' instead of just reacting to what they see.
What's the problem?
Current AI models that combine vision, language, and action often struggle in dynamic situations because they only respond to the present moment. They don't plan for the future, leading to mistakes and difficulty adapting when things change. Imagine trying to navigate a busy hallway – simply reacting to the person *right* in front of you won't prevent you from bumping into someone further down the hall.
What's the solution?
The researchers created F1, a model that predicts what the visual environment will look like in the near future based on a given task. It uses a special architecture with separate parts for understanding images, imagining future scenarios, and then deciding what action to take. Essentially, F1 doesn't just ask 'what should I do now?' but 'what will happen if I do this, and will it get me closer to my goal?' They also developed a training process using a huge amount of data to help F1 learn to reason and predict effectively.
Why it matters?
This work is important because it represents a step towards more intelligent and reliable AI agents. By enabling AI to anticipate future consequences, it can perform tasks more successfully in complex, real-world environments, like robotics or virtual assistants. It’s about moving beyond simple reactions to proactive planning and problem-solving.
Abstract
Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.