Representation Shift: Unifying Token Compression with FlashAttention
Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim
2025-08-06
Summary
This paper talks about Representation Shift, a new way to decide which parts of data, called tokens, are important to keep when processing images or videos, helping models work faster without needing extra training or complicated maps.
What's the problem?
The problem is that as AI models get bigger and handle more tokens, the amount of computing needed grows a lot, making the process slow and memory-heavy, especially with methods like FlashAttention that speed things up but don't give the usual tools for choosing important tokens.
What's the solution?
Representation Shift measures how much each token's meaning changes through the model without needing attention maps or retraining. This lets the system compress less important tokens while working smoothly with FlashAttention, boosting speed and efficiency in tasks like video-text matching and video question answering.
Why it matters?
This matters because it makes AI models much faster and more memory-efficient while keeping good accuracy, which helps build better systems for understanding and generating visual content like videos and images.
Abstract
Representation Shift is a training-free, model-agnostic metric that integrates token compression with FlashAttention, enabling significant speedups in video-text retrieval and video QA.