Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang

2025-11-18

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Summary

This paper introduces Uni-MoE 2.0, a new and improved open-source artificial intelligence model that can understand and generate content from different types of data like text, images, and audio. It's built upon previous work (Lychee's Uni-MoE series) and aims to be better at handling multiple types of information at once.

What's the problem?

Existing large AI models often struggle to effectively combine and process information from various sources – like understanding a video that has both visuals and spoken words. Also, training these models to be good at *everything* (understanding and creating different types of content) is really hard and requires a lot of computing power and carefully prepared data. Previous models weren't fully open-source, limiting research and development.

What's the solution?

The researchers created Uni-MoE 2.0 using a clever design called a 'Mixture of Experts'. Think of it like having different specialists within the AI, each good at a specific task. The model dynamically chooses which experts to use depending on the input. They also developed a training process that gradually introduces different types of data and uses a technique to stabilize the learning process. Finally, they used a large collection of publicly available data, specifically adding tokens that help the model learn to generate images and speech based on text prompts.

Why it matters?

Uni-MoE 2.0 is important because it pushes the boundaries of what's possible with open-source AI. It performs as well as, or even better than, many leading models on a variety of tasks, especially those involving multiple types of data like videos and audio. Being open-source means other researchers can build upon this work, accelerating progress in the field and making powerful AI technology more accessible.

Abstract

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

View Paper