CauSight: Learning to Supersense for Visual Causal Discovery

Yize Zhang, Meiqi Chen, Sirui Chen, Bo Peng, Yanxi Zhang, Tianyu Li, Chaochao Lu

2025-12-02

CauSight: Learning to Supersense for Visual Causal Discovery

Summary

This paper focuses on teaching AI to understand *why* things happen in images, not just *what* is happening, a skill called visual causal discovery.

What's the problem?

Current AI systems are good at identifying objects in images, but they struggle to understand the cause-and-effect relationships between those objects. For example, an AI might see a person pushing a box, but not understand that the push *causes* the box to move. This limits their ability to truly understand the world like humans do.

What's the solution?

The researchers created a large dataset of over 32,000 images with labels showing the causal relationships between objects. They then built a new AI model called CauSight, which uses this data and a special training process. This process involves creating 'reasoning paths' and rewarding the AI when it correctly identifies causes and effects. Essentially, they're teaching the AI to think through the 'why' behind what it sees.

Why it matters?

This work is important because it's a step towards building AI that can reason and understand the world more like humans. This could lead to improvements in areas like robotics, where robots need to understand how their actions affect the environment, and in image understanding, where AI needs to go beyond simply recognizing objects to understanding their interactions.

Abstract

Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.

View Paper