LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

Yirong Sun, Yizhong Geng, Peidong Wei, Yanjun Chen, Jinghan Yang, Rongfei Chen, Wei Zhang, Xiaoyu Shen

2025-08-22

LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

Summary

This paper introduces LLaSO, a complete and openly available system designed to help improve and speed up research in creating models that understand and generate both speech and language.

What's the problem?

Developing these speech and language models is currently difficult because researchers often don't share all the necessary information. Specifically, they release the finished model but not the data used to train it or exactly *how* it was trained. This makes it hard to compare different models fairly, reproduce results, and build upon existing work, unlike what's common in areas like image recognition where everything is usually shared.

What's the solution?

The researchers created LLaSO, which includes three main parts: a large collection of paired speech and text data called LLaSO-Align, a dataset of instructions for the model to follow called LLaSO-Instruct, and a standardized way to test and evaluate the models called LLaSO-Eval. They also built and released a sample model, LLaSO-Base, to show how the system works and provide a starting point for others. Everything – the data, the code, the model, and the results – is publicly available online.

Why it matters?

LLaSO is important because it provides a common foundation for researchers in this field. By making everything open and reproducible, it allows for more collaboration, faster progress, and more reliable results in the development of speech and language models. It addresses a key bottleneck in the field and sets a new standard for transparency.

Abstract

The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.

View Paper