LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo

2025-11-06

LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Summary

This paper focuses on the challenge of creating realistic 3D scenes from text descriptions using artificial intelligence, and then accurately evaluating how well the created scene matches the original description.

What's the problem?

Currently, AI struggles to generate 3D scenes that truly reflect the complexity and realism of the real world. This is because the instructions given to the AI are often too simple and don't provide enough detail. Furthermore, existing methods for checking if a generated scene matches the instructions aren't very good because they don't fully 'understand' the 3D scene and can misinterpret how objects relate to each other.

What's the solution?

The researchers developed two new tools: LEGO-Eval, which is a system for carefully checking how well a 3D scene aligns with its text description by focusing on the specific details of objects and their arrangement, and LEGO-Bench, a collection of very detailed instructions for creating complex 3D scenes. They showed that LEGO-Eval is better at judging scene quality than existing methods, and LEGO-Bench revealed that current AI models aren't very good at following detailed instructions for 3D scene creation.

Why it matters?

This work is important because if AI can't create realistic 3D environments, it will be difficult to train robots and virtual agents to operate effectively in the real world. If a robot learns to navigate a fake, unrealistic environment, it won't perform well when it's actually deployed in a real-life situation. Better evaluation tools help us improve these AI systems and make them more reliable.

Abstract

Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

View Paper