NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu

2026-02-13

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Summary

This paper introduces NarraScore, a new system for automatically creating music soundtracks for long videos like movies or documentaries.

What's the problem?

Currently, making good soundtracks for long videos is really hard for computers. It's difficult to create music that fits the entire length of a video, keeps a consistent style, and most importantly, understands what's happening in the video and matches the *feeling* of the story as it unfolds. Existing methods struggle with being efficient, maintaining a consistent mood, and truly 'getting' the narrative.

What's the solution?

NarraScore solves this by focusing on emotion as a way to understand the story. It uses existing image and text understanding programs (called Vision-Language Models) to 'watch' the video and track how the emotional feeling changes over time. Then, it uses a clever technique called a 'Dual-Branch Injection' to create music that matches both the overall style and the specific emotional moments in the video. It's designed to be efficient and avoid needing huge amounts of training data.

Why it matters?

This work is important because it allows for fully automatic soundtrack generation for long videos, something that hasn't been done well before. It means filmmakers and video creators could potentially have custom music created for their projects without needing a composer, and it does so in a way that's computationally efficient and produces high-quality, story-aligned music.

Abstract

Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a Global Semantic Anchor ensures stylistic stability, while a surgical Token-Level Affective Adapter modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.

View Paper