TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang

2025-05-26

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in
Real-World Scenarios

Summary

This paper talks about TIME, a new set of tests designed to see how well large language models can understand and reason about time in real-world situations.

What's the problem?

The problem is that language models often have trouble keeping track of time, understanding how events change over time, or figuring out the order and timing of things, especially when situations are complicated or involve lots of people and fast changes.

What's the solution?

The researchers created the TIME benchmark, which challenges language models with a variety of real-world tasks that require strong temporal reasoning, like understanding timelines, event sequences, and social interactions that change over time. They also studied how making the models bigger or running them differently during testing affects their performance.

Why it matters?

This is important because being able to reason about time is crucial for AI to be useful in the real world, whether it's for planning, understanding news, or helping people make decisions in situations where timing really matters.

Abstract

A benchmark called TIME assesses temporal reasoning in LLMs across varied real-world challenges, including intensive temporal information, fast-changing event dynamics, and complex social interactions, and evaluates the impact of test-time scaling.

View Paper