Building a Precise Video Language with Human-AI Oversight

Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang, Yu Tong Tiffany Ling, Hewei Wang, Irene Pi, Shihang Zhu, Ryan Rao, George Liu, Jiaxi Li, Ruojin Li, Yili Han, Yilun Du, Deva Ramanan

2026-04-27

Building a Precise Video Language with Human-AI Oversight

Summary

This paper focuses on improving how well computers understand and describe videos using both images and language, ultimately aiming for more accurate and detailed video captioning and generation.

What's the problem?

Currently, video-language models struggle to create truly precise and comprehensive captions for videos. Existing captions often lack detail about important aspects like what’s happening, where things are moving, and how the camera is moving. Creating these detailed captions manually is time-consuming and expensive, and relying solely on AI often results in inaccurate or incomplete descriptions.

What's the solution?

The researchers developed a new system called CHAI, which stands for Critique-based Human-AI Oversight. This system combines the strengths of both humans and AI. First, they created a detailed 'vocabulary' of visual elements – things like types of shots, camera movements, and objects – with input from professional filmmakers. Then, AI generates a first draft of a caption, and human experts review and improve it, focusing on correcting errors and adding detail. This process not only creates better captions but also provides valuable feedback to train the AI to generate even better captions in the future, using techniques like SFT and DPO. They used this approach to improve an open-source model called Qwen3-VL, making it perform better than some closed-source models like Gemini-3.1-Pro.

Why it matters?

This work is important because it demonstrates a way to achieve professional-level video understanding and generation. By combining human expertise with AI, they’ve created a system that can generate incredibly detailed captions, which can then be used to control video generation models, allowing for much finer control over things like camera angles and movements. This has implications for fields like filmmaking, video editing, and creating realistic virtual environments.

Abstract

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

View Paper