Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam-Woh Ng, Juan C. Pérez, Juan-Manuel~Pérez-Rúa, Tao Xiang, Wei Liu, Shikun Liu, Jürgen Schmidhuber

2025-11-20

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

Summary

This paper introduces a new way to combine different types of information, like text and images, when creating things with AI, specifically using a technique called diffusion models.

What's the problem?

Existing methods for combining information in these AI models often struggle with efficiently figuring out *how* different parts of each type of information relate to each other during the creation process, and they can require huge amounts of computing power and model size to get good results.

What's the solution?

The researchers developed a system called MoS, which stands for Mixture of States. It works by intelligently choosing which parts of each type of information to focus on at each step of the creation process. Imagine it like a smart router that connects the most relevant pieces of text to the corresponding parts of the image being generated. This 'router' learns what connections are important and does so without adding a lot of extra complexity or needing a massive model.

Why it matters?

MoS is important because it allows for creating high-quality results with significantly smaller and faster models compared to previous approaches. This means it's more practical to use and opens the door to building even more powerful multimodal AI systems that can understand and generate content from multiple sources.

Abstract

We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-k hidden states and is trained with an ε-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4times larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

View Paper