Revisiting Multimodal Positional Encoding in Vision-Language Models

Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, Shuai Bai

2025-11-03

Revisiting Multimodal Positional Encoding in Vision-Language Models

Summary

This paper investigates how to best represent the position of information in vision-language models, which are AI systems that can understand both images and text. It focuses on a specific technique called Rotary Positional Embedding (RoPE) and proposes improvements to it.

What's the problem?

Vision-language models need to understand not just *what* is in an image and text, but also *where* things are located relative to each other. Existing methods for encoding this positional information in a way that works well for both images and text haven't been thoroughly studied. The paper points out that simply applying RoPE to both types of data doesn't necessarily give the best results, and there's a lack of clear guidelines on how to do it effectively.

What's the solution?

The researchers analyzed RoPE and discovered three important principles for combining image and text position information: the positional information should be clear and consistent, all possible frequency ranges should be used to represent positions, and the model should leverage what it already knows about text from pre-training. Based on these principles, they created two new versions of RoPE, called Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I). These new methods are designed to be easily added to existing models without requiring major changes.

Why it matters?

These improvements to positional encoding are important because they lead to better performance in vision-language models. The new methods consistently outperform previous approaches on various tasks, meaning the AI can better understand both the general content and the specific details within images and text. This is a step towards creating AI systems that can more accurately interpret the world around us.

Abstract

Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.

View Paper