Reconstruction Alignment Improves Unified Multimodal Models

Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang

2025-09-10

Reconstruction Alignment Improves Unified Multimodal Models

Summary

This paper introduces a new way to improve how well AI models that combine images and text work together, specifically focusing on generating and editing images based on text prompts.

What's the problem?

Current AI models that handle both images and text are often trained using image-text pairs where the text descriptions, or captions, aren't detailed enough to capture all the important visual information in the image. Even long captions can miss subtle details, limiting the model's ability to truly 'understand' what it's seeing and recreate it accurately.

What's the solution?

The researchers developed a technique called Reconstruction Alignment, or RecA. Instead of relying on detailed captions, RecA uses the model's *own* understanding of an image – represented as a numerical code – as a prompt to reconstruct the original image. Essentially, the model tries to recreate the image from its internal representation, which forces it to improve its understanding and generation abilities. This is done *after* the model has already been initially trained, making it a relatively quick and easy improvement.

Why it matters?

RecA is important because it's a very efficient way to significantly boost the performance of these combined image-text models. It requires a small amount of computing power compared to retraining the entire model, and it works well with different types of models. The improvements in image generation and editing quality are substantial, even surpassing larger, more complex models, making it a practical and broadly applicable solution for improving AI's visual capabilities.

Abstract

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

View Paper