CoS: Chain-of-Shot Prompting for Long Video Understanding

Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, Shaogang Gong

2025-02-12

CoS: Chain-of-Shot Prompting for Long Video Understanding

Summary

This paper talks about Chain-of-Shot prompting (CoS), a new method to help AI systems better understand long videos by cleverly selecting the most important parts of the video to analyze.

What's the problem?

AI models that can understand both text and images (called Multi-modal Large Language Models or MLLMs) have trouble with long videos because there's too much information to process all at once. It's hard for these models to figure out which parts of the video are important for answering questions or understanding the content. If they look at too little of the video, they might miss important details, but if they try to look at everything, they get overwhelmed and confused.

What's the solution?

The researchers created CoS, which works in two main steps. First, it quickly scans through the video and marks each part as either relevant or not relevant to the task at hand, kind of like highlighting important sentences in a textbook. Then, it pairs up the relevant parts with some of the irrelevant parts to help the AI model understand the difference between what's important and what's not. This helps the AI focus on the right parts of the video without getting distracted by unnecessary information.

Why it matters?

This matters because it could make AI much better at understanding long videos, which is important for things like analyzing security footage, studying educational videos, or even helping robots understand their surroundings better. By making AI smarter about which parts of a video to pay attention to, we can create systems that are more efficient and accurate in tasks that involve long video content.

Abstract

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised <PRE_TAG>shot selections</POST_TAG> into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

View Paper