RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang

2024-12-13

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Summary

This paper introduces RuleArena, a new benchmark designed to test how well large language models (LLMs) can follow complex rules in real-world situations, like airline baggage fees, NBA transactions, and tax regulations.

What's the problem?

As LLMs become more advanced, it's important to see how well they can understand and apply complicated rules that we encounter in everyday life. Traditional benchmarks often don't capture the complexity of these real-world scenarios, which means we don't know how well LLMs can truly perform in practical applications.

What's the solution?

RuleArena addresses this issue by creating a set of challenging tests based on actual rules from three different areas: airline luggage fees, NBA player transactions, and tax laws. The benchmark includes 816 test problems that require LLMs to reason through the information and apply the correct rules to find solutions. The authors also analyze where LLMs struggle, such as identifying the right rules or performing accurate calculations.

Why it matters?

This research is significant because it provides a way to evaluate LLMs in realistic contexts, helping developers understand their strengths and weaknesses. By identifying the challenges LLMs face in rule-guided reasoning, this benchmark can lead to improvements in AI systems that need to operate effectively in real-world applications.

Abstract

This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. These results highlight significant challenges in advancing LLMs' rule-guided reasoning capabilities in real-life applications.

View Paper