PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Zeyu Xie, Xuenan Xu, Zhizheng Wu, Mengyue Wu

2024-07-04

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Summary

This paper talks about PicoAudio, a new system that allows for precise control over when sounds happen and at what pitch when generating audio from text, making audio outputs more dynamic and expressive.

What's the problem?

The main problem with existing audio generation systems is that they often lack the ability to control the timing and frequency of sounds accurately. This means that when creating audio from text, it can be hard to make sounds occur at the right moments or to adjust their pitches, leading to less engaging audio experiences.

What's the solution?

To solve this issue, the authors developed PicoAudio, which uses a special model designed to integrate temporal (time-related) information into the audio generation process. This system collects and processes data in a way that allows it to generate audio with precise timing and frequency. It does this by crawling data from the internet, segmenting it, and simulating audio that aligns closely with text descriptions. The results show that PicoAudio performs much better than previous models in controlling when sounds happen and how often they occur.

Why it matters?

This research is important because it enhances how we can create audio for various applications, such as storytelling, video games, and interactive media. By allowing for detailed control over sound events, PicoAudio can lead to richer and more immersive audio experiences, which are crucial for engaging audiences.

Abstract

Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://PicoAudio.github.io.

View Paper