Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages
Max Zuo, Francisco Piedrahita Velez, Xiaochen Li, Michael L. Littman, Stephen H. Bach
2024-07-05

Summary
This paper talks about Planetarium, a new benchmark created to evaluate how well language models can translate natural language descriptions of planning tasks into structured languages like PDDL, which is used for automated planning.
What's the problem?
The main problem is that while language models can generate code in PDDL, it's challenging to measure how good that code really is. Current methods often check if the generated code can be solved by a planner, but this doesn't ensure that the code matches the original task description. Additionally, many existing tests use descriptions that are too similar to the correct PDDL code, making the evaluation easier than it should be.
What's the solution?
To address these issues, the authors developed Planetarium, which includes a new algorithm to rigorously compare generated PDDL code against ground truth examples. They created a dataset with 132,037 pairs of natural language descriptions and their corresponding PDDL representations across various tasks and difficulty levels. This allows for a more comprehensive evaluation of how well models can translate tasks into PDDL. The authors also tested several language models and found that while many generated valid code, only a small percentage accurately reflected the intended task.
Why it matters?
This research is important because it provides a better way to assess the capabilities of language models in generating structured planning languages. By highlighting the gaps in current methods and offering a rigorous evaluation framework, Planetarium can help improve the development of AI systems that need to understand and execute complex planning tasks based on natural language instructions.
Abstract
Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of 132,037 text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 87.6% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 82.2% are valid, solve-able problems, but only 35.1% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.