Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi

2024-06-14

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Summary

This paper introduces 'Test of Time,' a new benchmark designed to evaluate how well large language models (LLMs) understand and reason about time-related information. It focuses on improving the assessment of LLMs' abilities in temporal reasoning tasks.

What's the problem?

While LLMs are good at reasoning, they often make mistakes when it comes to understanding complex temporal concepts, like the order of events or how long something takes. Previous studies have tested LLMs using real-world data, but this can lead to inconsistencies because the models might have seen similar data during their training. This makes it hard to accurately measure their true capabilities in temporal reasoning.

What's the solution?

To tackle these issues, the authors created synthetic datasets specifically designed for testing LLMs on temporal reasoning. These datasets include a variety of question types that help researchers systematically explore how different factors affect LLM performance. By using these controlled scenarios, the researchers can better assess the strengths and weaknesses of current models in understanding time-related tasks. They also plan to share these datasets and evaluation tools with the research community to encourage further studies in this area.

Why it matters?

This research is important because it provides a clearer way to evaluate how well AI models can understand and reason about time, which is crucial for many real-world applications like scheduling, planning, and understanding narratives. By improving the benchmarks used for testing, this work can help advance the development of more capable and reliable language models.

Abstract

Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: https://huggingface.co/datasets/baharef/ToT.

View Paper