LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

2025-06-15

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

Summary

This paper talks about LaTtE-Flow, a new type of AI model that combines image understanding and image generation into one system, making it faster and more efficient than previous models. It builds on existing powerful vision-language models and adds new designs to speed up image creation while keeping high quality.

What's the problem?

The problem is that existing models that try to do both image understanding and generation often run slowly when creating images or don’t perform as well as specialized models that focus on just one task. This makes them less practical for use in real applications.

What's the solution?

The solution was to design LaTtE-Flow with two main innovations. First, it divides the model's transformer layers into groups called Layerwise Timestep Experts, where each group works on a specific part of the image creation process at certain times. This reduces how much work the model has to do at once, making it faster. Second, it introduces Timestep-Conditioned Residual Attention, a method that helps the model reuse important information across layers efficiently during training and generation. These designs together allow the model to generate images much faster and also understand images well, with two versions of the model combining understanding and generation in slightly different ways.

Why it matters?

This matters because LaTtE-Flow can produce high-quality images quickly while also understanding visual and textual content, which is very useful for applications like AI art, multimedia tools, and any technology that needs both to interpret and create images. By speeding up image generation and keeping quality high, it makes advanced AI image tasks much more practical and accessible.

Abstract

LaTtE-Flow, a new architecture, unifies image understanding and generation with high performance and faster inference by using a Layerwise Timestep Experts flow-based Transformer and Timestep-Conditioned Residual Attention mechanism.

View Paper