JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh
2025-07-03
Summary
This paper talks about JAM-Flow, a new AI system that can create realistic speech and matching facial movements at the same time. It unifies how sound and face motions are generated, making the animation and voice match perfectly in videos or virtual avatars.
What's the problem?
The problem is that existing methods often treat speech generation and facial animation as separate tasks, which can lead to mismatches between what is said and what the face does. This makes talking animations look less natural and believable.
What's the solution?
The researchers developed JAM-Flow, which uses a special flow matching technique and a Multi-Modal Diffusion Transformer architecture. It has two linked parts: one generates facial expressions, especially lip movements, and the other produces speech audio. These parts communicate with each other using attention layers to stay synchronized. The model is trained in stages, first separately then together, to produce smooth and accurate audio-visual outputs.
Why it matters?
This matters because it allows for more natural and realistic virtual talking characters in video games, digital assistants, and virtual meetings, improving how people interact with AI and making digital communication feel more lifelike.
Abstract
JAM-Flow, a unified framework using flow matching and a Multi-Modal Diffusion Transformer, integrates facial motion and speech synthesis with selective joint attention layers and temporally aligned positional embeddings.