MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

MOSI. AI, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu

2026-01-07

MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

Summary

This paper introduces a new system called MOSS Transcribe Diarize that automatically writes down what people say in meetings and identifies *who* said *when*, all in one go.

What's the problem?

Current systems that try to do this are often built as separate pieces stuck together, struggle with long recordings, don't remember who's speaking over time very well, and can't directly tell you the exact time something was said. They also have trouble handling real-world recordings that aren't perfectly clean.

What's the solution?

The researchers created MOSS Transcribe Diarize, a single, powerful AI model that handles everything – writing down speech, figuring out who's talking, and adding timestamps – simultaneously. It's been trained on a huge amount of real-world audio data and can process up to 90 minutes of recording at a time, remembering what's been said for a longer period.

Why it matters?

This is a big step forward because it creates a more accurate and efficient way to transcribe meetings and conversations. MOSS Transcribe Diarize performs better than existing commercial systems, meaning better meeting notes, easier searching of recordings, and improved accessibility for those who need transcripts.

Abstract

Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.

View Paper