CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu

2026-02-16

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Summary

This paper introduces a new way to help AI understand videos more efficiently, focusing on how to process the information in videos without needing to look at every single frame.

What's the problem?

Current AI systems that try to understand videos often struggle with two main issues. First, they typically only look at a few key frames to save time and computing power, which means they can miss important events or small details happening between those frames. Second, processing all the visual information from each frame takes a lot of computational resources, making it slow and expensive.

What's the solution?

The researchers found a way to use information already built into video compression technology, specifically things called 'motion vectors' and 'residuals'. These pieces of data describe how things move in a video and what changes between frames, without needing to process the entire image each time. They then created a simplified AI system that can understand these compression details and combine them with information from the key frames, allowing the AI to understand the video faster and with less computing power. They also developed a way to train this system efficiently.

Why it matters?

This research is important because it makes video understanding AI much more practical. By significantly reducing the time and resources needed to process videos, it opens the door to more advanced AI applications that can analyze and understand video content in real-time, like answering questions about videos, understanding complex actions, and generally making AI 'watch' and comprehend videos more like humans do.

Abstract

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

View Paper