Conformer2

One of the key advancements of Conformer2 over its predecessor is its increased model size, which has expanded from 270 million parameters in Conformer1 to 450 million parameters. This enlargement allows the model to capture more complex patterns in speech data, leading to better performance across various metrics. The training process utilized a technique known as noisy student-teacher training, which combines labeled and unlabeled data to enhance the quality and quantity of training inputs. This method employs multiple teacher models to generate high-quality pseudo-labels, ensuring that the model does not overfit while still learning from a broad dataset.

Conformer2 has shown substantial improvements in specific areas critical for effective speech recognition. For instance, it achieves a 31.7% improvement in recognizing alphanumerics, which is crucial for applications involving numbers such as credit card information or order numbers. Additionally, it boasts a 6.8% reduction in proper noun error rates and a 12% boost in noise robustness, making it better suited for real-world audio conditions where background noise can interfere with clarity. These enhancements are particularly beneficial for industries that rely on accurate transcription for customer service interactions or content creation.

The model also emphasizes user control over transcription costs with the introduction of a feature called Speech Thresholds. This allows users to set a minimum duration for audio files before they are processed for transcription. By optimizing processing based on file length, users can manage costs effectively when dealing with various types of audio content, such as music or empty recordings.

Conformer2 is already integrated into AssemblyAI's API as the default speech recognition model, making it readily accessible for developers looking to incorporate advanced ASR capabilities into their applications. Users can obtain a free API token and access comprehensive documentation to facilitate integration into their products.

Key Features of Conformer2:

Enhanced Model Size: Increased from 270 million to 450 million parameters for improved performance.
Extensive Training Data: Trained on 1.1 million hours of English audio for robust recognition capabilities.
Noisy Student-Teacher Training: Utilizes semi-supervised learning techniques to enhance data quality and quantity.
Improved Recognition Metrics: Achieves significant gains in alphanumerics (31.7%), proper noun error rates (6.8%), and noise robustness (12%).
Speech Thresholds: Allows users to control transcription costs by setting minimum processing requirements based on audio duration.
Real-World Application Focus: Designed to perform well across various domains including telephony and podcasts.
Seamless Integration: Available through AssemblyAI's API with easy access via a free API token and detailed documentation.

Conformer2 represents a significant step forward in automatic speech recognition technology, providing enhanced accuracy and flexibility for users across multiple sectors. Its ability to adapt to real-world challenges makes it a valuable tool for anyone needing precise speech-to-text solutions.

Zero to AI Engineer

Subscribe to the AI Search Newsletter