JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov

2024-08-19

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Summary

This paper introduces JPEG-LM, a new way to generate images and videos using large language models (LLMs) by treating them like compressed files rather than raw pixel data.

What's the problem?

Generating images and videos with LLMs can be complicated because it requires converting continuous data (like images) into discrete tokens that the model can understand. Traditional methods either use long raw pixel values or complex techniques that take a lot of time to set up, which can be inefficient.

What's the solution?

The authors propose a simpler approach by modeling images and videos as compressed files, like JPEG for images and AVC for videos. This means that instead of generating every single pixel, the model directly outputs the compressed data. They trained JPEG-LM from scratch using this method, which proved to be more effective than previous methods, significantly reducing errors in generated images.

Why it matters?

This research is important because it simplifies the process of image and video generation, making it faster and more efficient. By bridging the gap between language processing and visual content creation, it opens up new possibilities for developing advanced AI systems that can handle multiple types of data together.

Abstract

Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization -- representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.

View Paper