Pisets: A Robust Speech Recognition System for Lectures and Interviews

Ivan Bondarenko, Daniil Grebenkin, Oleg Sedukhin, Mikhail Klementev, Roman Derunets, Lyudmila Budneva

2026-02-09

Pisets: A Robust Speech Recognition System for Lectures and Interviews

Summary

This paper introduces Pisets, a new speech-to-text system designed to be more accurate and reliable than existing models, especially for long recordings and challenging audio environments.

What's the problem?

Current speech-to-text models, like Whisper, can sometimes make mistakes, adding incorrect words (hallucinations) or misinterpreting what's said, particularly when dealing with lengthy audio or poor sound quality. This is a big issue for professionals like scientists and journalists who need precise transcriptions.

What's the solution?

The researchers built Pisets using a three-step process. First, it uses Wav2Vec2 to get an initial transcription. Then, it employs a system called the Audio Spectrogram Transformer (AST) to filter out likely errors. Finally, it refines the transcription with Whisper. They also used a technique called 'curriculum learning,' where the system learns from easier examples first, and trained it on a lot of Russian speech data. They also improved how the system estimates its own uncertainty, making it more confident in its accurate parts and flagging potentially incorrect sections.

Why it matters?

Pisets is important because it provides a more robust and accurate way to convert speech to text, especially for long audio files and in situations where the audio isn't perfect. This is crucial for fields where accurate records are essential, like scientific research and journalism, and the fact that the code is publicly available allows others to build upon and improve it.

Abstract

This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.

View Paper