Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

Raphael Ruschel, Hardikkumar Prajapati, Awsafur Rahman, B. S. Manjunath

2025-12-03

Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

Summary

This paper introduces Click2Graph, a new system that lets users interactively build a detailed understanding of what's happening in a video, going beyond just identifying objects to understanding *how* they relate to each other over time.

What's the problem?

Current video understanding systems are pretty good at automatically figuring out what's in a video and how things are connected, but they don't allow people to easily guide or correct them. On the other hand, some newer systems are great at letting users point to things in a video, but they don't really 'understand' the bigger picture or the relationships between objects. Essentially, there's a gap between automated understanding and user control.

What's the solution?

Click2Graph bridges this gap by combining the best of both worlds. If a user clicks on something in a video, the system not only tracks that object throughout the video, but also figures out what other objects it interacts with and *how* – for example, 'a person is riding a bike'. It does this using a 'Dynamic Interaction Discovery Module' to predict what other objects might be involved, and a 'Semantic Classification Head' to figure out the relationships between them. It builds a complete 'scene graph' that shows all the objects and their connections over time.

Why it matters?

This work is important because it moves video understanding towards being more controllable and interpretable. Instead of a computer just *telling* you what's happening, it allows you to guide the process and verify that the computer is understanding the video the way *you* do. This could be useful for things like video editing, creating summaries, or even helping people with visual impairments understand video content.

Abstract

State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

View Paper