HI-TransPA: Hearing Impairments Translation Personal Assistant

Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan, Peidong Wang, Mingjun Pan, Yuhao Mo, Jiajie Cheng, Chengxin Chen, Zhonglun Cao, Chonghan Liu, Shi Cheng

2025-11-17

HI-TransPA: Hearing Impairments Translation Personal Assistant

Summary

This paper introduces HI-TransPA, a new AI system designed to help people with hearing loss communicate more easily, acting like a personal assistant that understands both spoken and visual cues.

What's the problem?

Existing AI models struggle to accurately understand speech from people with hearing loss, and they often have trouble combining information from both audio and video—like someone’s lip movements—to improve understanding. The raw data, like videos of people speaking, can also be messy and inconsistent, making it hard for the AI to learn effectively. Current 'Omni-Models' aren't specifically tailored for the unique challenges of hearing-impaired speech.

What's the solution?

The researchers created a system that carefully prepares the audio and video data, focusing on the lip movements. They use a process to find key points on the face, isolate the lips, and check the quality of the data. Then, they train the AI model in stages, starting with clear examples and gradually adding more difficult ones. They also use a special way of encoding lip movements to make the process more efficient. This system, HI-TransPA, combines audio and video information to translate speech and hold conversations.

Why it matters?

This work is important because it shows how powerful AI models can be adapted to create better assistive technology for people with hearing loss. It provides a complete framework and tools that other researchers can use to build even more advanced communication aids in the future, potentially leading to more natural and effective communication for everyone.

Abstract

To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

View Paper