Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models
Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen
2024-07-13

Summary
This paper introduces Live2Diff, a new model designed for translating live streaming videos using a unique approach called uni-directional attention. This method allows the model to focus only on past frames instead of looking at both past and future frames, which is important for real-time video processing.
What's the problem?
While large language models have been successful in generating streaming data like text and audio, video streaming has not been as well explored. Current video models use bi-directional attention, meaning they consider both past and future frames when processing video. This approach is not suitable for live streaming because it requires knowledge of future frames, which are not available in real-time scenarios.
What's the solution?
Live2Diff solves this issue by implementing a uni-directional attention mechanism that only looks at the current frame and previous frames. This ensures that the model maintains temporal consistency and smoothness in the video output. Additionally, it uses an efficient denoising method that helps process video quickly, allowing for interactive frame rates during live translation. The model correlates the current frame with earlier frames and a few initial warmup frames to create a seamless streaming experience.
Why it matters?
This research is significant because it addresses the growing need for effective live video processing and translation. By improving how models handle real-time video, Live2Diff can enhance applications in areas like live broadcasting, online gaming, and virtual meetings, making them more efficient and user-friendly.
Abstract
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio, thanks to their temporally uni-directional attention mechanism, which models correlations between the current token and previous tokens. However, video streaming remains much less explored, despite a growing need for live video processing. State-of-the-art video diffusion models leverage bi-directional temporal attention to model the correlations between the current frame and all the surrounding (i.e. including future) frames, which hinders them from processing streaming videos. To address this problem, we present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation. Compared to previous works, our approach ensures temporal consistency and smoothness by correlating the current frame with its predecessors and a few initial warmup frames, without any future frames. Additionally, we use a highly efficient denoising scheme featuring a KV-cache mechanism and pipelining, to facilitate streaming video translation at interactive framerates. Extensive experiments demonstrate the effectiveness of the proposed attention mechanism and pipeline, outperforming previous methods in terms of temporal smoothness and/or efficiency.