Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao
2025-12-12
Summary
This paper explores using reinforcement learning, a type of machine learning where an agent learns to make decisions by trial and error, to improve the creation of 3D models from text descriptions. It builds on successes with similar techniques in 2D images and large language models, but finds 3D generation much more difficult.
What's the problem?
Creating 3D models is harder than 2D images because 3D objects are complex – they need to look right from all angles and have detailed surfaces. This means small mistakes in the generation process can be very noticeable. Existing methods struggle because it's difficult to design 'rewards' that guide the AI to create good 3D models, and current benchmarks don't accurately measure how well these models actually understand what they're supposed to create. Essentially, telling an AI what a good 3D model *should* look like is tricky.
What's the solution?
The researchers did a thorough investigation, testing different reward systems, reinforcement learning algorithms, and even created a new benchmark called MME-3DR to better evaluate 3D generation. They found that rewards based on what humans prefer work best, and that using powerful, general AI models to assess the 3D qualities is helpful. They also developed a new algorithm, Hi-GRPO, that breaks down the 3D creation process into stages – first the overall shape, then the finer details – and optimizes each stage separately. This led to AR3D-R1, a new AI model that generates high-quality 3D models from text.
Why it matters?
This work is important because it opens the door to using reinforcement learning for more advanced 3D content creation. It provides valuable insights into what works and what doesn't when applying these techniques to 3D models, and the new benchmark will help researchers measure progress in the field. Ultimately, this could lead to AI systems that can automatically generate complex 3D objects for applications like gaming, design, and virtual reality.
Abstract
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.