Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao
2025-12-30
Summary
This paper tackles the difficult problem of getting computers to 'see' transparent and reflective objects in videos, something current systems struggle with because light bends and bounces in complicated ways.
What's the problem?
Typical computer vision systems rely on things like comparing images from two cameras (stereo vision) or measuring how long it takes light to return (Time-of-Flight) to understand depth. However, these methods break down with transparent objects like glass or shiny surfaces like metal because light doesn't travel in straight lines – it bends (refraction) and reflects, creating confusing signals and inaccurate depth maps. Existing methods often produce incomplete or shaky depth estimations for these materials.
What's the solution?
The researchers noticed that advanced video generation models, which create realistic videos from scratch, are actually pretty good at simulating how light interacts with transparent and reflective surfaces. So, they created a large dataset of synthetic videos featuring these kinds of objects using a 3D rendering program. Then, they took a pre-trained video model and fine-tuned it using this new dataset, teaching it to predict depth and surface normals (which describe the orientation of a surface) specifically for transparent and reflective scenes. This fine-tuning was done efficiently using a technique called LoRA, which only adjusts a small part of the original model. The resulting model, called DKT, combines information from the video's colors and the predicted depth to create more accurate and stable results.
Why it matters?
This work is important because it shows that we can leverage the power of generative AI – models that *create* things – to improve computer *perception* – how computers 'see' the world. DKT significantly improves depth estimation for transparent and reflective objects, even when tested on real-world videos. This is crucial for applications like robotics, where robots need to accurately understand their environment to grasp objects, and augmented reality, where virtual objects need to realistically interact with the real world. The results suggest that these generative models have already 'learned' a lot about how light works, and we can tap into that knowledge without needing tons of labeled data.
Abstract
Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.