Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo

2025-02-19

Text2World: Benchmarking Large Language Models for Symbolic World Model
Generation

Summary

This paper talks about Text2World, a new way to test how well large language models (LLMs) can create symbolic world models from written descriptions. It's like creating a standardized test for AI to see how well they can understand and represent complex environments based on text alone.

What's the problem?

Previous ways of testing LLMs for world modeling had issues. The tests were often random, relied on indirect ways of measuring success, and only covered a small range of situations. This made it hard to really know how good these AI models were at understanding and representing complex worlds.

What's the solution?

The researchers created Text2World, a new benchmark that uses a special language called PDDL to represent many different types of worlds. They made sure to include hundreds of diverse scenarios and used multiple ways to directly test how well the AI models performed. They then used Text2World to test current LLMs and found that models trained with advanced learning techniques did the best, though even these still had limitations.

Why it matters?

This matters because as AI becomes more advanced, we need better ways to test and improve its ability to understand and model complex situations. Text2World provides a standardized way to do this, which could help researchers develop AI that can better understand and interact with the world around us. This could lead to more capable AI assistants, better planning systems, and improvements in fields like robotics and automated decision-making.

Abstract

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

View Paper