Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Yanhong Li, Zixuan Lan, Jiawei Zhou

2025-10-23

Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Summary

This paper explores a new way to feed text into large language models, specifically by turning the text into an image. The goal is to reduce the amount of data the model needs to process, potentially making it faster and cheaper to use, without losing accuracy.

What's the problem?

Large language models work by breaking down text into smaller pieces called tokens. Processing many tokens can be expensive and slow, especially with long documents. The problem is finding a way to reduce the number of tokens needed to represent text without making the model perform worse at tasks like understanding information or summarizing documents.

What's the solution?

The researchers found that converting long texts into images and then feeding those images to the language model works surprisingly well. Instead of processing each word as a separate token, the entire text is rendered as a single image. This dramatically reduces the number of tokens the model needs to handle, often cutting it nearly in half. They tested this approach on tasks like finding information in long texts and summarizing news articles.

Why it matters?

This research is important because it offers a new method for compressing text inputs for large language models. Reducing the number of tokens processed can lead to significant cost savings and faster processing times, making these powerful models more accessible and practical for a wider range of applications. It shows a creative way to leverage the image processing capabilities of modern language models.

Abstract

Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.

View Paper