CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, Xiaodong Gu

2026-02-04

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Summary

This paper investigates whether we can use images of code instead of just the text of the code to help large language models (LLMs) understand it more efficiently.

What's the problem?

LLMs are getting really good at understanding code, but as programs get bigger, it takes a lot more computing power to process them. The way LLMs currently work treats code as a long string of characters, and the longer the code, the more processing is needed. This becomes a major slowdown.

What's the solution?

The researchers explored using Multimodal LLMs (MLLMs) which can understand both text *and* images. They converted code into images – basically screenshots of the code – and then had the MLLM analyze those images instead of the raw text. They found they could significantly reduce the amount of data the model needed to process by compressing the images, without losing too much information about the code itself.

Why it matters?

This research shows that representing code as images could be a way to make LLMs much faster and more efficient at understanding large software projects. It suggests a potential shift in how we feed code to these models, moving away from just text and towards using visual representations, which could unlock the ability to work with even more complex systems.

Abstract

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.

View Paper