MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim

2025-10-09

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Summary

This research investigates how video generation models, specifically Diffusion Models (DiTs), understand and represent interactions between different objects or people within a video. They created a new dataset and a method to improve how these models handle complex scenes.

What's the problem?

Current video generation models are really good at creating videos, but they often struggle when there are multiple things happening at once, or when objects are interacting with each other. It's unclear *how* these models are even thinking about these interactions internally – what parts of the model are responsible for understanding 'who is doing what to whom' in a video. This leads to videos that don't quite make sense or have objects behaving unrealistically.

What's the solution?

The researchers built a new dataset called MATRIX-11K, which includes videos with detailed descriptions focusing on the interactions happening within them, and also tracks the different objects in each frame. They then analyzed existing video generation models to see which parts of the model were 'paying attention' to these interactions. They found that only a few specific layers were crucial. Based on this, they developed a technique called MATRIX, which essentially guides those important layers to focus on the correct interactions, using the information from their new dataset. They also created a new way to evaluate how well a model understands and generates interactions.

Why it matters?

This work is important because it helps us understand the inner workings of advanced video generation models. By improving how these models represent interactions, we can create more realistic, coherent, and believable videos. This has implications for a wide range of applications, from creating special effects in movies to generating training data for robots and developing more engaging virtual experiences.

Abstract

Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.

View Paper