Act2Goal: From World Model To General Goal-conditioned Policy

Pengfei Zhou, Liliang Chen, Shengcong Chen, Di Chen, Wenzhi Zhao, Rongjun Jin, Guanghui Ren, Jianlan Luo

2025-12-30

Act2Goal: From World Model To General Goal-conditioned Policy

Summary

This paper introduces a new approach called Act2Goal for teaching robots how to perform complex manipulation tasks, like moving objects around, by showing them what the desired end result should look like.

What's the problem?

Currently, robots struggle with tasks that require many steps because they typically try to figure out what to do one step at a time. This makes it hard for them to plan ahead and complete longer, more complicated actions. Simply showing a robot the final goal isn't enough; they need a way to understand the sequence of actions needed to get there.

What's the solution?

Act2Goal solves this by giving the robot a 'visual imagination'. It first predicts a series of intermediate steps – what things should look like along the way to achieving the goal. Then, it uses a clever system called Multi-Scale Temporal Hashing to break down this plan into both detailed, immediate actions and broader, long-term goals. This allows the robot to react to unexpected changes while still staying focused on the overall task. The system learns through trial and error, and can quickly improve its performance without needing constant human guidance.

Why it matters?

This research is important because it significantly improves a robot's ability to handle complex tasks in the real world. By allowing robots to plan and adapt, Act2Goal makes them much more reliable and useful in situations where precise, multi-step actions are required, and it does so with minimal human intervention, leading to faster learning and deployment.

Abstract

Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero-shot generalization to novel objects, spatial layouts, and environments. We further enable reward-free online adaptation through hindsight goal relabeling with LoRA-based finetuning, allowing rapid autonomous improvement without external supervision. Real-robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating that goal-conditioned world models with multi-scale temporal control provide structured guidance necessary for robust long-horizon manipulation. Project page: https://act2goal.github.io/

View Paper