InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang

2025-12-11

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Summary

This paper introduces InfiniteVL, a new type of Vision-Language Model (VLM) designed to be both powerful and efficient, especially when dealing with long sequences of information like in videos or long documents.

What's the problem?

Existing VLMs struggle with long inputs. Traditional methods like 'window attention' can only focus on limited parts of the input at a time, losing information when the input is longer than that window. Other methods, like 'linear attention', are faster but don't perform well when a lot of detailed information needs to be processed, such as recognizing text in images or understanding complex documents.

What's the solution?

The researchers combined two techniques: 'sliding window attention' which focuses on local parts of the input, and 'Gated DeltaNet' which helps the model remember important information over long distances. They also developed a special training process in three steps – first, learning from a large amount of data, then fine-tuning with instructions, and finally, improving performance on long sequences. This allows InfiniteVL to process long inputs effectively and efficiently.

Why it matters?

InfiniteVL is significant because it achieves performance comparable to the best VLMs, but with much less computing power and memory. It's much faster than existing methods, especially when processing continuous data like video, and can maintain a consistent speed without needing more memory. This makes it practical for real-world applications where resources are limited, like on phones or in real-time video analysis.

Abstract

Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

View Paper