ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu

2025-06-30

ShotBench: Expert-Level Cinematic Understanding in Vision-Language
Models

Summary

ShotBench is a benchmark created to test how well vision-language models can understand the detailed visual language used in movies. It includes over 3,500 expert-made question-answer pairs taken from images and video clips of more than 200 famous films, mostly Oscar-nominated. The goal is to see if AI models can grasp complex movie-making elements like camera angles, lighting, and shot composition.

What's the problem?

Current AI models that combine vision and language are good at general understanding, but they struggle to understand the specific and subtle visual grammar professionals use in filmmaking. This lack of skill limits their ability to analyze movies in detail or generate accurate video content.

What's the solution?

ShotBench provides a thorough test focused on eight important areas of cinematography, such as shot size and camera movement, to measure AI’s skills precisely. Alongside this, a large dataset called ShotQA with about 70,000 question-answer pairs was created for training. Using this data, the ShotVL model was developed, which is better than previous models at understanding cinematic language by being trained with special techniques.

Why it matters?

This work is crucial because it pushes AI to better understand movies at the expert level, which can improve film analysis, automated editing, and AI-assisted storytelling. Improving AI’s ability in these areas opens up new possibilities for creative and technical tools in the film industry.

Abstract

ShotBench and ShotQA datasets, along with ShotVL model, enhance AI's understanding and generation capabilities by specifically targeting nuanced cinematic language comprehension.

View Paper