Temporal Reasoning Transfer from Text to Video
Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, Qi Liu
2024-10-10

Summary
This paper discusses a method called Textual Temporal reasoning Transfer (T3), which helps improve how large language models (LLMs) understand time in videos by using text-based tasks.
What's the problem?
Video LLMs have difficulty understanding and reasoning about time-related changes in videos, which is crucial for tasks like tracking actions over time. Previous research suggested that this problem was due to how video data was processed, but it turns out that the real issue lies in the LLM's struggles with temporal concepts when dealing with text as well.
What's the solution?
To solve this problem, the researchers developed T3, which creates various temporal reasoning tasks using existing image-text datasets. This allows the model to learn about time without needing a lot of video data. By focusing on text-based temporal reasoning, T3 successfully improves the temporal understanding of the LongVA-7B model, resulting in better performance on video tasks without using any video data during training.
Why it matters?
This research is important because it shows that LLMs can enhance their ability to understand time-related concepts by learning from text. This approach not only improves their performance on video tasks but also highlights the connection between text and video reasoning. By advancing how AI models handle temporal reasoning, this work can lead to better applications in areas like video analysis, content creation, and interactive media.
Abstract
Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B's temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.