ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

Junfu Pu, Teng Wang, Yixiao Ge, Yuying Ge, Chen Li, Ying Shan

2025-11-20

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

Summary

This paper focuses on automatically dividing long videos, like lectures or documentaries, into chapters. It introduces a new model called ARC-Chapter that's much better at this task than previous methods.

What's the problem?

Currently, it's hard to automatically create good chapters for long videos because most existing methods are trained on small amounts of data with simple chapter labels. These labels often just give a basic title and don't capture the nuances of what's happening throughout the video, making it difficult for the models to generalize to new, complex videos. Essentially, existing systems struggle with the length and detail needed for truly useful chaptering.

What's the solution?

The researchers created a huge dataset of long videos with detailed, multi-level chapter annotations in both English and Chinese. They combined information from speech transcripts, text appearing in the video, and image captions to create these annotations, ranging from short titles to longer summaries. They then trained ARC-Chapter on this massive dataset and also developed a new way to measure how well the chaptering is working, called GRACE, which is more realistic than older methods. This new metric accounts for the fact that there can be multiple valid ways to divide a video into chapters.

Why it matters?

This work is important because it significantly improves the accuracy of automatic video chaptering. ARC-Chapter outperforms previous models by a large margin, and it also works well when applied to other video-related tasks, like automatically generating descriptions of what's happening in a video. Better chaptering makes long videos much easier to navigate and understand, which is useful for students, researchers, and anyone who consumes online video content.

Abstract

The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.

View Paper