Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers

Atakan Işık, Selin Vulga Işık, Ahmet Feridun Işık, Mahşuk Taylan

2026-01-01

Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers

Summary

This paper focuses on improving the accuracy of automatically identifying different respiratory sounds, like wheezes or crackles, using artificial intelligence.

What's the problem?

Currently, it's hard to build good AI models for this task because the datasets used to train them are small, noisy, and don't have a balanced representation of all the different sounds. This means the models often get confused and don't work well on new patients. Specifically, powerful models called Transformers can easily 'memorize' the training data instead of learning general patterns, leading to poor performance on unseen data.

What's the solution?

The researchers improved a Transformer model called AST by using a technique called Sharpness-Aware Minimization (SAM). Instead of just trying to get the model to make accurate predictions on the training data, SAM also tries to find a solution that's less sensitive to small changes in the data. Think of it like finding a stable spot in a valley instead of a precarious peak. They also used a method to make sure the model paid attention to all the different respiratory sounds, even the rare ones.

Why it matters?

This work is important because it achieved the best results yet on a standard respiratory sound dataset, and more importantly, it significantly improved the model's ability to correctly identify patients who *do* have a respiratory problem. This is crucial for building reliable tools that can help doctors screen patients and diagnose lung conditions more effectively.

Abstract

Respiratory sound classification is hindered by the limited size, high noise levels, and severe class imbalance of benchmark datasets like ICBHI 2017. While Transformer-based models offer powerful feature extraction capabilities, they are prone to overfitting and often converge to sharp minima in the loss landscape when trained on such constrained medical data. To address this, we introduce a framework that enhances the Audio Spectrogram Transformer (AST) using Sharpness-Aware Minimization (SAM). Instead of merely minimizing the training loss, our approach optimizes the geometry of the loss surface, guiding the model toward flatter minima that generalize better to unseen patients. We also implement a weighted sampling strategy to handle class imbalance effectively. Our method achieves a state-of-the-art score of 68.10% on the ICBHI 2017 dataset, outperforming existing CNN and hybrid baselines. More importantly, it reaches a sensitivity of 68.31%, a crucial improvement for reliable clinical screening. Further analysis using t-SNE and attention maps confirms that the model learns robust, discriminative features rather than memorizing background noise.

View Paper