Composing Concepts from Images and Videos via Concept-prompt Binding

Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao

2025-12-11

Composing Concepts from Images and Videos via Concept-prompt Binding

Summary

This paper introduces a new technique called Bind & Compose for creating images from multiple ideas taken from both pictures and videos, aiming to make the final image more accurately reflect what the user wants.

What's the problem?

Currently, it's difficult for computers to understand complex visual ideas and combine them effectively when creating new images or videos. Existing methods struggle to accurately pull out the important parts of images and videos and then blend those concepts together in a way that makes sense and looks good. They often miss details or don't quite capture the intended meaning.

What's the solution?

Bind & Compose works by linking visual concepts – like 'a red car' or 'a sunny beach' – to specific words or 'tokens' that the computer understands. It then uses a special structure within a type of AI called a Diffusion Transformer to carefully combine these concepts. To make this linking more accurate, they developed a 'Diversify-and-Absorb Mechanism' which filters out unimportant details. They also created a 'Temporal Disentanglement Strategy' to better handle concepts from videos by breaking down the process into steps that focus on how things change over time.

Why it matters?

This research is important because it improves the ability of AI to create visuals based on complex instructions, opening up possibilities for more creative tools and applications. It allows for more precise control over the generated images and videos, leading to results that are more consistent with the user’s vision and higher quality overall.

Abstract

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

View Paper