WHISTRESS: Enriching Transcriptions with Sentence Stress Detection
Iddo Yosha, Dorin Shteyman, Yossi Adi
2025-05-27
Summary
This paper talks about WHISTRESS, a new system that can figure out which words in a sentence are spoken with extra emphasis, called sentence stress, when transcribing speech. It does this without needing to match up the speech with the written text word by word, and it was trained using computer-generated examples.
What's the problem?
The problem is that most speech transcription systems just write down the words someone says, but they miss out on important details like which words are stressed. Knowing sentence stress can help computers understand meaning better, but current methods for detecting it are not very accurate or flexible.
What's the solution?
The authors created WHISTRESS, which learns to detect sentence stress from synthetic, or fake, training data instead of needing lots of real-world examples. It doesn't need to align the audio with the text exactly, which makes it faster and easier to use. Tests show that WHISTRESS works better than older systems and can handle different types of speech well.
Why it matters?
This matters because adding sentence stress information to transcriptions can help computers understand spoken language more like humans do. This could improve things like voice assistants, language learning apps, and any technology that relies on understanding how people speak, making them smarter and more helpful.
Abstract
WHISTRESS is an alignment-free method for sentence stress detection trained on synthetic data, outperforming existing methods and generalizing well to diverse benchmarks.