OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng

2026-04-14

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Summary

This paper introduces a new system called OmniShow that can create realistic videos of people interacting with objects, all based on a text description, a reference image, audio, and information about the person's pose.

What's the problem?

Currently, creating these kinds of videos is difficult because existing methods struggle to effectively combine all the necessary information – text, images, audio, and pose – to produce high-quality, controllable results. There's a trade-off between being able to precisely control what happens in the video and making it look realistic, and also a lack of enough training data and good ways to measure how well these systems are performing.

What's the solution?

The researchers developed OmniShow, which uses a few key techniques to address these problems. First, they have a way to efficiently incorporate image and pose data. Second, they use a method to make sure the audio and video stay perfectly synchronized. Third, they developed a training strategy that cleverly combines different datasets to overcome the lack of data. Finally, they created a new benchmark, called HOIVG-Bench, to properly evaluate how well these video generation systems work.

Why it matters?

This work is important because it moves us closer to being able to automatically generate videos for things like online shopping demonstrations, creating short videos for social media, or even building more interactive entertainment experiences. By creating a system that can handle multiple types of input and produce high-quality videos, and by providing a standard way to measure performance, this research sets a new standard for the field and opens up possibilities for more realistic and useful video generation.

Abstract

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

View Paper