LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue

2026-04-01

LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Summary

This paper introduces a new way to build AI models that can understand and work with different types of information – text, images, and audio – all at the same time, rather than treating them as separate pieces.

What's the problem?

Current AI systems that handle multiple types of data, like images and text, often treat these different types as add-ons to a core language model. This means they aren't truly integrated and don't work together as effectively as they could, leading to limitations in how well the AI understands and generates content across these different formats.

What's the solution?

The researchers developed a framework called DiNA, which represents all types of data – text, images, and audio – as discrete tokens in a shared space. They also created a new visual processing component, dNaViT, that can break down images into these tokens at different levels of detail. This allows them to build a model, LongCat-Next, that processes everything together using a single set of rules, improving how it understands and creates multimodal content.

Why it matters?

This work is important because it moves beyond simply combining different AI models and instead creates a truly unified system. LongCat-Next performs well on various tasks involving images, text, and audio, and the researchers are sharing their work with the AI community to encourage further advancements in building more capable and versatile AI systems.

Abstract

The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next

View Paper