SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios

Kai Li, Wendi Sang, Chang Zeng, Runxuan Yang, Guo Chen, Xiaolin Hu

2024-10-03

SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios

Summary

This paper introduces SonicSim, a new simulation platform designed to create realistic speech data for situations where sounds are coming from moving sources, helping improve speech processing models.

What's the problem?

Evaluating how well speech separation and enhancement models work in real-life scenarios can be difficult because there isn't enough diverse data available. Real-world datasets often lack the variety needed for training, while synthetic datasets (computer-generated data) may not sound realistic enough. This means that neither type of data fully meets the needs of researchers trying to improve these models.

What's the solution?

To solve this problem, the authors developed SonicSim, a customizable toolkit that generates high-quality synthetic data for moving sound sources. SonicSim allows users to adjust different aspects of the simulation, such as the environment, microphone placement, and the characteristics of the sound sources. Using SonicSim, they created a benchmark dataset called SonicSet, which includes various scenarios to evaluate speech models. They also compared this synthetic data with real-world recordings to ensure its effectiveness.

Why it matters?

This research is important because it provides a way to generate realistic speech data that can help improve models used for separating and enhancing speech in noisy environments. By using SonicSim, researchers can create better tools for applications like voice recognition and hearing aids, ultimately leading to more effective communication technologies.

Abstract

The systematic evaluation of speech separation and enhancement models under moving sound source conditions typically requires extensive data comprising diverse scenarios. However, real-world datasets often contain insufficient data to meet the training and evaluation requirements of models. Although synthetic datasets offer a larger volume of data, their acoustic simulations lack realism. Consequently, neither real-world nor synthetic datasets effectively fulfill practical needs. To address these issues, we introduce SonicSim, a synthetic toolkit de-designed to generate highly customizable data for moving sound sources. SonicSim is developed based on the embodied AI simulation platform, Habitat-sim, supporting multi-level adjustments, including scene-level, microphone-level, and source-level, thereby generating more diverse synthetic data. Leveraging SonicSim, we constructed a moving sound source benchmark dataset, SonicSet, using the Librispeech, the Freesound Dataset 50k (FSD50K) and Free Music Archive (FMA), and 90 scenes from the Matterport3D to evaluate speech separation and enhancement models. Additionally, to validate the differences between synthetic data and real-world data, we randomly selected 5 hours of raw data without reverberation from the SonicSet validation set to record a real-world speech separation dataset, which was then compared with the corresponding synthetic datasets. Similarly, we utilized the real-world speech enhancement dataset RealMAN to validate the acoustic gap between other synthetic datasets and the SonicSet dataset for speech enhancement. The results indicate that the synthetic data generated by SonicSim can effectively generalize to real-world scenarios. Demo and code are publicly available at https://cslikai.cn/SonicSim/.

View Paper