Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin

2025-01-15

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Summary

This paper talks about Tarsier2, a new AI system that's really good at understanding videos. It can describe what's happening in videos with a lot of detail and accuracy, and it's also great at answering questions about videos.

What's the problem?

Current AI systems struggle to fully understand videos. They might miss important details or fail to grasp the overall meaning of what's happening. This makes it hard for AI to do things like describe videos accurately or answer complex questions about them.

What's the solution?

The researchers created Tarsier2, which is like a super-smart video-watching AI. They made it better in three main ways: First, they showed it way more videos and descriptions during training - like going from watching 11 million videos to 40 million. Second, they taught it to pay closer attention to how things change over time in videos. Third, they used a clever trick to help the AI learn what kinds of video descriptions people prefer. They tested Tarsier2 against other top AI systems and found it did better on all sorts of video-related tasks.

Why it matters?

This matters because it could change how we interact with videos. Imagine having an AI that can watch a video and tell you exactly what's happening, answer any questions you have about it, or even help people who can't see understand what's in a video. It could be used for things like making video content more accessible, helping with video search engines, or even in security systems that need to understand what's happening in surveillance footage. Plus, since Tarsier2 is so good at understanding videos in general, it could lead to new ways of using AI with videos that we haven't even thought of yet.

Abstract

We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8\% over GPT-4o and 5.8\% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\% performance advantage over GPT-4o and +24.9\% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.

View Paper