^RFLAV: Rolling Flow matching for infinite Audio Video generation

Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati

2025-03-12

^RFLAV: Rolling Flow matching for infinite Audio Video generation

Summary

This paper talks about RFLAV, an AI tool that creates endless videos with matching sound by using a smart system that keeps audio and visuals perfectly in sync, like making a music video where every beat matches the dancer’s moves.

What's the problem?

Making AI-generated videos with sound that stays in sync and looks natural is hard, especially for long videos where mismatched audio or choppy visuals can ruin the experience.

What's the solution?

RFLAV uses a transformer-based design that processes sound and video separately at first, then combines them later with a special module to keep everything aligned, frame by frame, without needing pre-set video lengths.

Why it matters?

This helps create realistic videos for movies, games, or social media that can play forever without losing quality or sync, making AI-generated content more professional and engaging.

Abstract

Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present , a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.

View Paper