Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto
2026-01-06
Summary
This paper introduces a new system called Talk2Move that lets you change where objects are in a picture just by telling it what to do with words. It's a way to make images respond to natural language instructions about object placement.
What's the problem?
Current AI systems are pretty good at changing *how* things look in an image based on text, like making a picture more 'artistic' or changing colors. However, they struggle with actually *moving* objects around, rotating them, or changing their size. This is because it's hard to get enough examples of text paired with the exact pixel changes needed to make these kinds of geometric adjustments, and directly tweaking pixels isn't very effective.
What's the solution?
Talk2Move uses a technique called reinforcement learning, where the AI learns by trial and error. Instead of needing a ton of perfectly matched examples, it explores different ways to move objects by making small changes and seeing how well those changes match the text instructions. It uses a clever reward system that specifically encourages the AI to get the object's position, rotation, and size correct, and it focuses its learning on the most important steps in the transformation process. It also uses slight variations in the text to help it learn more robustly.
Why it matters?
This work is important because it brings us closer to AI systems that can truly understand and respond to our instructions about the visual world. Being able to manipulate objects in images with natural language has a lot of potential applications, like image editing, creating virtual environments, and even robotics where a robot could rearrange objects based on spoken commands.
Abstract
We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.