Representation Shift: Unifying Token Compression with FlashAttention

Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, Hyunwoo J. Kim

2025-08-06

Representation Shift: Unifying Token Compression with FlashAttention

Summary

This paper talks about Representation Shift, a new way to decide which parts of data, called tokens, are important to keep when processing images or videos, helping models work faster without needing extra training or complicated maps.

What's the problem?

The problem is that as AI models get bigger and handle more tokens, the amount of computing needed grows a lot, making the process slow and memory-heavy, especially with methods like FlashAttention that speed things up but don't give the usual tools for choosing important tokens.

What's the solution?

Representation Shift measures how much each token's meaning changes through the model without needing attention maps or retraining. This lets the system compress less important tokens while working smoothly with FlashAttention, boosting speed and efficiency in tasks like video-text matching and video question answering.

Why it matters?

This matters because it makes AI models much faster and more memory-efficient while keeping good accuracy, which helps build better systems for understanding and generating visual content like videos and images.

Abstract

Representation Shift is a training-free, model-agnostic metric that integrates token compression with FlashAttention, enabling significant speedups in video-text retrieval and video QA.

View Paper